You are on page 1of 11

Unraveling the Power of Natural Language

Processing: Tokenization, Stemming, Text


Classification, and Name Entity Recognition

 Introduction:

o Natural Language Processing (NLP) refers to the branch of computer science-and


more specifically, the branch of artificial intelligence or AI-concerned with giving
computers the ability to understand text and spoken words in much the same way
human beings can.
o NLP combines computational linguistics-rule-based modeling of human
language-with statistical, machine learning, and deep learning models. Together,
these technologies enable computers to process human language in the form of
text or voice data and to ‘understand’ its full meaning, complete with the speaker
or writer’s intent and sentiment.
o Natural Language Toolkit (NLTK) is one of the vastest Python libraries for
performing various Natural Language Processing tasks. From rudimentary tasks
such as text pre-processing to tasks like vectorized representation of text-NLK’s
API has covered everything.
o This blog explores three fundamental concepts with NLP: Tokenization and
Stemming, Text classification, and Name Entity Recognition (NER).
 Tokenization:

o Tokenization, in the realm of Natural Language Processing (NLP) and machine


learning, refers to the process of converting a sequence of text into smaller parts,
known as tokens. These can be as small as characters or as long as words. The
primary reason this process matters is that it helps machines understand human
language by breaking it down it bite-sized pieces, which are easier to analyze.

o The primary goal of tokenization is to represent text in a manner that’s


meaningful for machines without losing its context. By converting text into
tokens, algorithms can more easily identify patterns. The pattern recognition is
crucial because it makes it possible for machines to understand and respond to
human input. For instance, when a machine encounters to word “running”, it
doesn’t see it as singular entity but rather as a combination of tokens that it can
analyze and derive meaning form.

o To delve deeper into the mechanics, consider the sentence, “Chatbots are helpful”.
When we tokenize this sentence by words, it transforms into an array of
individual words: ["Chatbots", "are", "helpful"].

o This is a straightforward approach where spaces typically dictate the boundaries


of tokens. However, if we were to tokenize by characters, the sentence would
fragment into: ["C", "h", "a", "t", "b", "o", "t", "s", " ", "a", "r", "e", " ", "h", "e", "l", "p",
"f", "u", "l"].

o Example:

import nltk
from nltk.tokenize import word_tokenize

# Sample sentence
sentence = "Natural Language Processing is fascinating!"

# Tokenization using nltk's word_tokenize


tokens = word_tokenize(sentence)

# Display the original sentence and the tokens


print("Original Sentence:")
print(sentence)
print("\nTokenized Words:")
print(tokens)
o Types of Tokenization:

Tokenization methods vary based on the granularity of the text breakdown and
the specific requirements of the task at hand. These methods can range from
dissecting text into individual words to breaking them down into characters or
even smaller units. Here’s closer look at the different types:

 Word tokenization: This method breaks text down into individual


words. It’s the most common approach and is particularly effective
for languages with clear word boundaries like English.

 Character tokenization: Here, the text is segmented into


individual characters. This method is beneficial for languages that
lack clear word boundaries or for tasks that require a granular
analysis, such as spelling correction.

 Subword tokenization: Striking a balance between word and


character tokenization, this method breaks text into units that might
be larger than a single character but smaller than a full word. For
instance, “Chatbots” could be tokenized into “Chats” and “bots”.
This approach is especially useful for languages that form meaning
by combining smaller units or when dealing with out-of-
vocabulary words in NLP tasks.
o Tokenization Use Cases:

Tokenization serves as the backbone for a myriad of applications in the


digital realm, enabling machines to process and understand vast amounts of
text data. By breaking down text into manageable chunks, tokenization
facilitates more efficient and accurate data analysis. Here are some
prominent use cases where tokenization plays a pivotal role:

 Search engines: When you type into a search engine like Google,
it employs tokenization to dissect your input. This breakdown
helps the engine sift through billions of documents to present you
with the most relevant results.

 Machine translation: Tools such as Google Translate utilize


tokenization to segment sentences in the source language. Once
tokenized, these segments can be translated and then reconstructed
in the target language, ensuring the translation retains the original
context.

 Speech recognition: Voice-activated assistants like Siri or Alexa


rely heavily on tokenization. When you pose a question or
command, your spoken words are first converted into text. This
text is tokenized, allowing the system to process and act upon your
request.

 Stemming:

o Stemming is a natural language processing technique that is used to reduce words


to their base form, also known as the root form. The process of stemming is used
to normalize text and make it easier to process. It is an important step in text pre-
processing, and it is commonly used in information retrieval and text mining
applications.

o There are several different algorithms for stemming, including the Porter
stemmer, Snowball stemmer, and the Lancaster stemmer. The Porter stemmer is
the most widely used algorithm, and it is based on a set of heuristics that are used
to remove common suffixes from words. The Snowball stemmer is more
advanced algorithm that is based on the Porter stemmer, but it also supports
several other languages in addition to English. The Lancaster stemmer is a more
aggressive stemmer and it is less accurate than the Porter stemmer and Snowball
stemmer.

o Stemming can be useful for several natural language processing tasks such as text
classification, information retrieval, and text summarization. However, stemming
can also have some negative effects such as reducing the readability of the text,
and it may not always produce the correct root form of a word.
o Applications of Stemming:

1. Stemming is used in information retrieval systems like search engines.


2. It is used to determine domain vocabularies in domain analysis.
3. To display search results by indexing while documents are evolving
into numbers and to map documents to common subjects by stemming.
4. Sentiment Analysis, which examines reviews and comments made by
different users about anything, is frequently used for product analysis,
such as for online retail stores. Before it is interpreted, stemming is
accepted in the form of the text-preparation mean.
5. A method of group analysis used on textual materials is called
document clustering. Important uses of it include subject extraction,
automatic document structuring, and quick information retrieval.

o Some Stemming algorithms are:

 Porter’s Stemmer algorithm

 It is one of the most popular stemming methods proposed in 1980.


It is based on the idea that the suffixes in the English language are
made up of a combination of smaller and simpler suffixes. This
stemmer is known for its speed and simplicity. The main
applications of Porter Stemmer include data mining and
Information retrieval. However, its applications are only limited to
English words. Also, the group of stems is mapped on to the same
stem and the output stem is not necessarily a meaningful word.
 Example: EED  EE means “if the word has at least one vowel
and consonant plus EED ending, change the ending to EE” as
‘agreed’ becomes ‘agree’.

 Lovins Stemmer:

 When compared to the Porter Stemmer, the Snowball Stemmer can


map non-English words too. Since it supports other languages the
Snowball Stemmers can be called multi-lingual stemmer. The
Snowball stemmers are also imported from the nltk package. This
stemmer is based on a programming language called ‘Snowball’
that processes small strings and is the most widely used stemmer.

 The Snowball stemmer is way more aggressive than Porter


Stemmer and is also referred to as Porter2 Stemmer. Because of the
improvements added when compared to the Porter Stemmer, the
Snowball stemmer is having greater computational speed.

 Text Classification:
o Text classification is a common NLP tasks used to solve business problems in
various fields. The goal of text classification is to categorize or predict a class of
unseen text documents, often with the help of supervised machine learning.
Similar to a classification algorithm that has been trained on a tabular dataset to
predict a class. Text classification also uses supervised machine learning. The fact
that text is involved in text classification is the main distinction between the two.

o Text Classification Use-Cases and Applications:

 Spam Classification: There are many practical use cases for text
classification across many industries. For example, a spam filter is a
common application that uses text classification to sort emails into spam
and non-spam categories.

 Classifying news articles and blogs: Another use case is to


automatically assign text documents into predetermined categories. A
supervised machine learning model is trained on labeled data, which
includes both the raw text and the target. Once a model is trained, it is then
used in production to obtain a category on the new and unseen data.

 Categorize customer support requests: A company might use text


classification to automatically categorize customer support requests by
topic or to prioritize and route requests to the appropriate department.

 Hate speech detection: With over 1.7 billion daily active users,
Facebook inevitably has content created on the site that is against the
rules. Hate speech is included in this undesirable content. Facebook
tackles this issue by requesting a manual review of postings that an AI text
classifier has identified as hate speech. Postings that were flagged by AI
are examined in the same manner as posts that users have reported. In fact,
in just the first three months of 2020, the platform removed 9.6 million
items of content that had been classified as hate speech.

o Types of Text Classification Systems:

 Rule-based text classification: Rule-based techniques use a set of


manually constructed language rules to categorize text into categories or
groups. These rules tell the system to classify text into a particular
category based on the content of a text by using semantically relevant
textual elements. An antecedent or pattern and a projected category make
up each rule. For example, imagine you have tons of new articles, and
your goal is to assign them to relevant categories such as Sports, Politics,
Economy, etc. Rule-based systems can be refined over time and are
understandable to humans. However, there are certain drawbacks to this
strategy.

 Machine learning-based text classification: Machine learning-based text


classification is a supervised machine learning problem. It learns the
mapping of input data with the labels. This is similar to non-text
classification problems where we train a supervised classification
algorithm on a tabular dataset to predict a class, with the exception that in
text classification, the input data is raw text instead of numeric features.

 Training phase: A supervised machine learning algorithm is trained on


the input-labeled dataset during the training phase. At the end of this
process, we get a trained model that we can use to obtain predictions on
new and unseen data.

 Prediction phase: Once a machine learning model is trained, it can be


used to predict labels on new and unseen data. This is usually done by
deploying the best model from an earlier phase as an API on the server.

 Named Entity Recognition(NER):

o Named Entity Recognition is a crucial task in Natural Language Processing that


involves identifying and classifying entities within a given text. The goal of NER
is to extract structured information from unstructured text data, making it easier to
analyze and understand the content. NER plays a key role in various applications,
including information retrieval, question-answering systems, and language
translation.
o Named Entity Recognition is a sub-task of information extraction in Natural
Language Processing that classifies named entities into predefined categories such
as person names, organizations, locations, medical codes, time expressions,
quantities, monetary values, and more. In the realm of NLP, understanding these
entities is crucial for many applications, as they often contain the most significant
information in a text.

o Named Entity Recognition serves as a bridge between unstructured text and


structured data, enabling machines to sift through vast amounts of textual
information and extract nuggets of valuable data in categorized forms. By
pinpointing specific entities within a sea of words, NER transforms the way we
process and utilize textual data.

o How it works: The intricacies of NER can be broken down into several steps:

1. Tokenization: Before identifying entities, the text is split into tokens, which
can be words, phrases, or even sentences. For instance, “Steve Jobs co-
founded Apple” would be split into tokens like “Steve”, “Jobs”, “co-
founded”, “Apple”.

2. Entity identification: Using various linguistic rules or statistical methods,


potential named entities are detected. This involves recognizing patterns, such
as capitalization in names or specific formats.

3. Entity classification: Once entities are identified, they are categorized into
predefined classes such as "Person", "Organization", or "Location". This is
often achieved using machine learning models trained on labeled datasets. For
our example, "Steve Jobs" would be classified as a "Person" and "Apple" as
an "Organization".

4. Contextual analysis: NER systems often consider the surrounding context to


improve accuracy. For instance, in the sentence "Apple released a new
iPhone", the context helps the system recognize "Apple" as an organization
rather than a fruit.

5. Post-processing: After initial recognition and classification, post-processing


might be applied to refine results. This could involve resolving ambiguities,
merging multi-token entities, or using knowledge bases to enhance entity data.

o Named Entity Recognition Methods:

Named Entity Recognition (NER) has seen many methods developed over the
years, each tailored to address the unique challenges of extracting and
categorizing named entities from vast textual landscapes.
 Rule-based Methods: Rule-based methods are grounded in manually
crafted rules. They identify and classify named entities based on linguistic
patterns, regular expressions, or dictionaries. While they shine in specific
domains where entities are well-defined, such as extracting standard
medical terms from clinical notes, their scalability is limited. They might
struggle with large or diverse datasets due to the rigidity of predefined
rules.

 Statistical Methods: Transitioning from manual rules, statistical methods


employ models like Hidden Markov Models (HMM) or Conditional
Random Fields (CRF). They predict named entities based on likelihoods
derived from training data. These methods are apt for tasks with ample
labeled datasets at their disposal. Their strength lies in generalizing across
diverse texts, but they're only as good as the training data they're fed.

 Machine Learning Methods: Machine learning methods take it a step


further by using algorithms such as decision trees or support vector
machines. They learn from labeled data to predict named entities. Their
widespread adoption in modern NER systems is attributed to their prowess
in handling vast datasets and intricate patterns. However, they're hungry
for substantial labeled data and can be computationally demanding.

 Deep Learning Methods: The latest in the line are deep learning
methods, which harness the power of neural networks. Recurrent Neural
Networks (RNN) and transformers have become the go-to for many due to
their ability to model long-term dependencies in text. They're ideal for
large-scale tasks with abundant training data but come with the caveat of
requiring significant computational might.

 Hybrid Methods: Lastly, there's no one-size-fits-all in NER, leading to


the emergence of hybrid methods. These techniques intertwine rule-based,
statistical, and machine learning approaches, aiming to capture the best of
all worlds. They're especially valuable when extracting entities from
diverse sources, offering the flexibility of multiple methods. However,
their intertwined nature can make them complex to implement and
maintain.
o Named Entity Recognition Use Cases:

NER has found applications across diverse sectors, transforming the way
we extract and utilize information. Here's a glimpse into some of its
pivotal applications:

 News aggregation: NER is instrumental in categorizing news articles by


the primary entities mentioned. This categorization aids readers in swiftly
locating stories about specific people, places, or organizations,
streamlining the news consumption process.

 Customer support: Analyzing customer queries becomes more efficient


with NER. Companies can swiftly pinpoint common issues related to
specific products or services, ensuring that customer concerns are
addressed promptly and effectively.

 Research. For academics and researchers, NER is a boon. It allows them


to scan vast volumes of text, identifying mentions of specific entities
relevant to their studies. This automated extraction speeds up the research
process and ensures comprehensive data analysis.

 Legal document analysis. In the legal sector, sifting through lengthy


documents to find relevant entities like names, dates, or locations can be
tedious. NER automates this, making legal research and analysis more
efficient.

You might also like