Professional Documents
Culture Documents
Natural Language Processing (NLP) Zero To Mastery Part I - Foundations - by ChenDataBytes - Dec, 2023 - Medium
Natural Language Processing (NLP) Zero To Mastery Part I - Foundations - by ChenDataBytes - Dec, 2023 - Medium
Search Write
ChenDataBytes · Follow
10 min read · Dec 28, 2023
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 1/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
Regarding the foundational aspects of NLP, we will delve into ten essential
topics: lemmatization, stemming, part-of-speech tagging, stop words,
pattern matching, sentence segmentation, named entity recognition,
tokenization, word embedding and bag-of-words.
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 2/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
NLP concepts
Linguistic Basics
1. Stemming
Stemming is a linguistic method utilized to obtain the base or root form of
words by removing letters from the word’s end. Its objective is to simplify
words by disregarding tense, pluralization, and other grammatical
variations. The Porter stemming algorithm employs a collection of
predetermined rules and heuristics to eliminate common English suffixes,
thereby converting words into their corresponding stems. SpaCy doesn’t
have a built-in implementation of the Porter Stemmer, so we use nltk for this
exercise.
p_stemmer = PorterStemmer()
words = ["runner", "running", "ran"]
for word in words:
print(word+' --> '+p_stemmer.stem(word))
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 3/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
2. Lemmatization
In contrast to stemming, lemmatization is a more sophisticated linguistic
process that aims to reduce a word to its base or dictionary form, known as a
lemma. Lemmatization takes into account factors such as part-of-speech
(POS) tags and contextual understanding to ensure accurate and meaningful
transformations.
import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u"The dedicated runner, after running for hours, finally ran across t
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 4/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
3. Part of speech
Part of speech refers to the grammatical category of a word in a sentence,
such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, or
interjection. Part of speech tagging can be used for various purposes,
including identifying named entities and speech recognition. The
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 5/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
probabilities of part of speech tags occurring near one another can be used
to generate the most reasonable output.
import spacy
nlp = spacy.load(‘en_core_web_sm’)
doc = nlp(u"The dedicated runner, after running for hours, finally ran across th
#print pos for “after"
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))
4. Stop words
Common words such as “a” and “the” occur so frequently in text that they
often do not carry significant meaning compared to nouns, verbs, and
modifiers. These commonly occurring words are referred to as stop words
and can be excluded or filtered out during text processing. Spacy provides a
built-in list of approximately 305 English stop words that can be readily
utilized.
import spacy
nlp = spacy.load('en_core_web_sm')
sentences = "The dedicated runner, after running for hours, finally ran across t
def remove_stopwords(sentence):
sentence = sentence.lower()
words = sentence.split()
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 6/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
remove_stopwords(sentences)
5. Pattern Match
Pattern matching entails the identification and extraction of linguistic
patterns or structural information from text. This process involves searching
for specific sequences of words, phrases, or syntactic structures that
conform to predefined patterns or rules.
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'The dedicated runner, after running for hours, finally ran across th
matcher = PhraseMatcher(nlp.vocab)
matches = matcher(doc)
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 7/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
6. Sentence Segmentation
Sentence segmentation refers to the process of dividing a document or a
textual piece into individual sentences. In natural language processing,
accurately identifying sentence boundaries is crucial for various text
analysis and language understanding tasks.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'This is the first sentence. This is another sentence. This is the la
for sent in doc.sents:
print(sent)
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 8/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'The dedicated runner, after running for hours, finally ran across th
for ent in doc.ents:
print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_
Text Representation
8. Tokenization
Tokenization involves breaking down the original text into smaller
components known as tokens. These tokens can be created based on
contiguous sequences of characters or words. The utilization of spacy for
tokenization is demonstrated below.
import spacy
nlp = spacy.load('en_core_web_sm')
mystring = '"The dedicated runner, after running for hours, finally ran across t
doc = nlp(mystring)
# Counting Tokens
print("\n Token Counts:",len(doc))
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 9/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 10/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 11/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 12/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 13/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
The code provided demonstrates two ways of adding the embedding layer in
a TensorFlow model. The first method is to use the Embedding layer. The
second method is the use of TensorFlow Hub to build a neural network
model using the Universal Sentence Encoder as a pre-trained embedding
layer. By setting trainable=True, the layer parameters can be fine-tuned
during training.
#embedding method 1
model = tf.keras.Sequential([
# Add an Embedding layer with the correct parameters
# input_dim Integer. Size of the vocabulary, i.e. maximum integer index + 1.
# output_dim Integer. Dimension of the dense embedding.
# input_length Length of input sequences, when it is constant.
# 2D tensor with shape: (batch_size, input_length).
# 3D tensor with shape: (batch_size, input_length, output_dim).
tf.keras.layers.Embedding(input_dim=num_words, output_dim=embedding_dim, inp
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(265, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1, activation='softmax'),
])
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 14/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
#embedding method 2
model = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
trainable=True, dtype=tf.string, input_shape=[]),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(1, activation="sigmoid")
])
10. Bag-of-Words/TF-IDF
The bag-of-words approach represents text as an assortment or “bag” of
individual words or tokens, disregarding their specific order or sequence. It
creates a numerical representation of a document or corpus by tallying the
occurrences of each word in the text. However, in the bag-of-words model,
the original word order is discarded, and the focus is solely on the frequency
of word occurrence.
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 15/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
If the TF-IDF value is high, it means the term is both common in the
document and rare across the entire corpus, making it a distinctive feature
of that document.
documents = [
"The dedicated runner, after running for hours, finally ran across the finis
"I enjoy running",
"I like to run in the morning"
]
vectorizer = TfidfVectorizer()
tfidf_vectors = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 16/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
End note:
In summary, this NLP primer covered ten basic concepts, from tokenization
and word embeddings to part-of-speech tagging and named entity
recognition. This foundational understanding sets the stage for Part II,
where we’ll explore practical NLP applications across diverse domains.
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 17/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
33 Followers
ChenDataBytes ChenDataBytes
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 18/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
27 28
ChenDataBytes ChenDataBytes
7 3 3
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 19/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
55 50
Lists
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 20/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
336 1
58 1 7
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 21/22
1/16/24, 7:30 PM Natural Language Processing (NLP) Zero to Mastery Part I: Foundations | by ChenDataBytes | Dec, 2023 | Medium
https://medium.com/@chenycy/natural-language-processing-nlp-zero-to-mastery-part-i-foundations-d87331c737fb 22/22