Professional Documents
Culture Documents
Introduction:
o To delve deeper into the mechanics, consider the sentence, “Chatbots are helpful”.
When we tokenize this sentence by words, it transforms into an array of
individual words: ["Chatbots", "are", "helpful"].
o Example:
import nltk
from nltk.tokenize import word_tokenize
# Sample sentence
sentence = "Natural Language Processing is fascinating!"
Tokenization methods vary based on the granularity of the text breakdown and
the specific requirements of the task at hand. These methods can range from
dissecting text into individual words to breaking them down into characters or
even smaller units. Here’s closer look at the different types:
Search engines: When you type into a search engine like Google,
it employs tokenization to dissect your input. This breakdown
helps the engine sift through billions of documents to present you
with the most relevant results.
Stemming:
o There are several different algorithms for stemming, including the Porter
stemmer, Snowball stemmer, and the Lancaster stemmer. The Porter stemmer is
the most widely used algorithm, and it is based on a set of heuristics that are used
to remove common suffixes from words. The Snowball stemmer is more
advanced algorithm that is based on the Porter stemmer, but it also supports
several other languages in addition to English. The Lancaster stemmer is a more
aggressive stemmer and it is less accurate than the Porter stemmer and Snowball
stemmer.
o Stemming can be useful for several natural language processing tasks such as text
classification, information retrieval, and text summarization. However, stemming
can also have some negative effects such as reducing the readability of the text,
and it may not always produce the correct root form of a word.
o Applications of Stemming:
Lovins Stemmer:
Text Classification:
o Text classification is a common NLP tasks used to solve business problems in
various fields. The goal of text classification is to categorize or predict a class of
unseen text documents, often with the help of supervised machine learning.
Similar to a classification algorithm that has been trained on a tabular dataset to
predict a class. Text classification also uses supervised machine learning. The fact
that text is involved in text classification is the main distinction between the two.
Spam Classification: There are many practical use cases for text
classification across many industries. For example, a spam filter is a
common application that uses text classification to sort emails into spam
and non-spam categories.
Hate speech detection: With over 1.7 billion daily active users,
Facebook inevitably has content created on the site that is against the
rules. Hate speech is included in this undesirable content. Facebook
tackles this issue by requesting a manual review of postings that an AI text
classifier has identified as hate speech. Postings that were flagged by AI
are examined in the same manner as posts that users have reported. In fact,
in just the first three months of 2020, the platform removed 9.6 million
items of content that had been classified as hate speech.
o How it works: The intricacies of NER can be broken down into several steps:
1. Tokenization: Before identifying entities, the text is split into tokens, which
can be words, phrases, or even sentences. For instance, “Steve Jobs co-
founded Apple” would be split into tokens like “Steve”, “Jobs”, “co-
founded”, “Apple”.
3. Entity classification: Once entities are identified, they are categorized into
predefined classes such as "Person", "Organization", or "Location". This is
often achieved using machine learning models trained on labeled datasets. For
our example, "Steve Jobs" would be classified as a "Person" and "Apple" as
an "Organization".
Named Entity Recognition (NER) has seen many methods developed over the
years, each tailored to address the unique challenges of extracting and
categorizing named entities from vast textual landscapes.
Rule-based Methods: Rule-based methods are grounded in manually
crafted rules. They identify and classify named entities based on linguistic
patterns, regular expressions, or dictionaries. While they shine in specific
domains where entities are well-defined, such as extracting standard
medical terms from clinical notes, their scalability is limited. They might
struggle with large or diverse datasets due to the rigidity of predefined
rules.
Deep Learning Methods: The latest in the line are deep learning
methods, which harness the power of neural networks. Recurrent Neural
Networks (RNN) and transformers have become the go-to for many due to
their ability to model long-term dependencies in text. They're ideal for
large-scale tasks with abundant training data but come with the caveat of
requiring significant computational might.
NER has found applications across diverse sectors, transforming the way
we extract and utilize information. Here's a glimpse into some of its
pivotal applications: