You are on page 1of 3

Shristi Tiwari: Be :Comp 141

EXPERIMENT NO 2

Aim: Write a Program to perform Tokenization and Filtration.


Theory:

1. Tokenization:
In Python tokenization basically refers to splitting up a larger body of text into smaller lines,
words or even creating words for a non-English language.
Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence,
information engineering, and human-computer interaction. This field focuses on how to program
computers to process and analyze large amounts of natural language data. It is difficult to
perform as the process of reading and understanding languages is far more complex than it seems
at first glance.

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can
think of token as parts like a word is a token in a sentence, and a sentence is a token in a
paragraph.
The various tokenization functions in-built into the nltk module itself and can be used in
programs as shown below.
❖ Line Tokenization
In the below example we divide a given text into different lines by using the function
sent_tokenize.
import nltk
sentence_data = "The First sentence is about Python. The Second: about Django. You can
learn Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

Output
['The First sentence is about Python.', 'The Second: about Django.', 'You can learn
Python,Django and Data Ananlysis here.']

❖ Non-English Tokenization
In the below example we tokenize the German text.
import nltk
german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
german_tokens=german_tokenizer.tokenize('Wie geht es Ihnen? Gut, danke.')
print(german_tokens)

Output:
['Wie geht es Ihnen?', 'Gut, danke.']

❖ Word Tokenzitaion
We tokenize the words using word_tokenize function available as part of nltk.
import nltk
word_data = "It originated from the idea that there are readers who prefer learning new skills
from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

Output:
['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers',
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']

2. Filtration:
Filtering is the process of removing stop words or any of the unnecessary data from the given
sentence.
Many of the words used in the phrase are insignificant and hold no meaning. For example –
English is a subject. Here, ‘English’ and ‘subject’ are the most significant words and ‘is’, ‘a’ are
almost useless. English subject and subject English holds the same meaning even if we remove
the insignificant words – (‘is’, ‘a’). Using the nltk, we can remove the insignificant words by
looking at their part-of-speech tags. For that we have to decide which Part-Of-Speech tags are
significant.

Code:

print ("Significant words : \n",


filter_insignificant([('your', 'PRP$'),
('book', 'NN'), ('is', 'VBZ'),
('great', 'JJ')],
tag_suffixes = ['PRP', 'PRP$']))

Output :

[('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')]

Significant words :
[('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')]
Conclusion: Hence, we have performed a program on Tokenization and Filtration.

You might also like