You are on page 1of 16

Welcome!!

Ladies & Gentleman in the course of

Natural Language
Processing

Instructor
M. Faheem Khan
Lecture No 3
Previous Review
01 Language Processing

02 Levels of Text Processing

03 Language Processing Pipeline

04 Stages of Comprehensive NLP System

05 Installation – Python - NLTK

06 Preprocessing Text - Tokenization


Today’s Agenda
01 Text Preprocessing

02 Text Tokenization

03 Text Normalization

04 Stop Word Removal

05 Part of Speech (POS)Tagging

06 Stemming & Lemmatization


Text Preprocessing

Text Preprocessing is traditionally an important step for NLP tasks. It


transforms text into more digestible form so that data can be used for
further processing tasks of NLP.
Text Preprocessing
Text Preprocessing involves following steps
Text Tokenization
Text Normalization
Stop Word Removal
POS Tagging
Stemming / Lemmatization
Removing HTML Tags
Removing extra spaces
Removing Numbers
Removing Special Characters
Expand Contractions
Conversion of Accented characters
Conversion of upper case into lower case
Conversion of lower case into upper case
Conversion of numbers words into numeric forms
Tokenization
The process of breaking down a text into smaller chunks such as words or sentence is called Tokenization.

Types of Tokenization
1- Word Tokenization
2- Sentence Tokenization
3- Punctuation Tokenization
4- Regular Expression Tokenization
Tokenization
Word Tokenization
Converting a sentence / paragraph into words.

from nltk.tokenize import word_tokenize


words = word_tokenize(text)
print(words)

We see something like this when we execute the above script:


Tokenization
Converting a paragraph into sentence.

from nltk.tokenize import sent_tokenize


sent = sent_tokenize(text)
print(sent)
Stop Words Removal
Just like when we talk to another person via a call, there tends to be some noise over the call
which is unwanted information. In the same manner, text from real world also contain noise which
is termed as Stopwords. Stopwords can vary from language to language but they can be easily
identified. Some of the Stopwords in English language can be – is, are, a, the, an etc. .

We can look at words which are considered as Stopwords by NLTK for English language with the
following code snippet:

from nltk.corpus import stopwords


nltk.download('stopwords')

language = "english"
stop_words = set(stopwords.words(language))
print(stop_words)
Stop Words Removal
Stop Words Removal
These stop words should be removed from the text if you
want to perform a precise text analysis for the piece of
text provided. Let’s remove the stop words from our
textual tokens:

filtered_words = []

for word in words:


if word not in stop_words:
filtered_words.append(word)

filtered_words
POS Tagging
:To identify and group each word in terms of their value, i.e. if each of the word is a noun or a verb
or something else. This is termed as Part of Speech tagging. Let’s perform POS tagging now:

tokens=nltk.word_tokenize(sentences[0])
print(tokens)

We see something like this when we execute the above script:


Frequency Distribution
:we can also calculate frequency of each word in the text we used. It is very simple to do with
NLTK, here is the code,

from nltk.probability import FreqDist

distribution = FreqDist(words)
print(distribution)

We see something like this when we execute the above script:


Frequency Distribution
Next, we can find most common words in the text with a simple function which accepts the number
of words to show:

# Most common words


distribution.most_common(2)

We see something like this when we execute the above script:


Thank you
Insert the title of your subtitle Here

You might also like