Professional Documents
Culture Documents
Natural Language
Processing
Instructor
M. Faheem Khan
Lecture No 3
Previous Review
01 Language Processing
02 Text Tokenization
03 Text Normalization
Types of Tokenization
1- Word Tokenization
2- Sentence Tokenization
3- Punctuation Tokenization
4- Regular Expression Tokenization
Tokenization
Word Tokenization
Converting a sentence / paragraph into words.
We can look at words which are considered as Stopwords by NLTK for English language with the
following code snippet:
language = "english"
stop_words = set(stopwords.words(language))
print(stop_words)
Stop Words Removal
Stop Words Removal
These stop words should be removed from the text if you
want to perform a precise text analysis for the piece of
text provided. Let’s remove the stop words from our
textual tokens:
filtered_words = []
filtered_words
POS Tagging
:To identify and group each word in terms of their value, i.e. if each of the word is a noun or a verb
or something else. This is termed as Part of Speech tagging. Let’s perform POS tagging now:
tokens=nltk.word_tokenize(sentences[0])
print(tokens)
distribution = FreqDist(words)
print(distribution)