Welcome to Scribd!

Untitled

Uploaded by

0% found this document useful (0 votes)

1 views16 pages

This document discusses the key steps in natural language processing (NLP) text preprocessing including tokenization, normalization, stop word removal, part-of-speech (POS) tagging, and stemming/lemmatization. It provides examples of using NLTK Python library functions to tokenize text into words and sentences, identify English stop words, perform POS tagging, and calculate word frequency distributions. The document is a lecture on NLP text preprocessing methods and includes code snippets demonstrating how to implement the techniques in Python.

Original Description:

Copyright

Available Formats

PPTX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

1 views16 pages

Untitled

Uploaded by

Mohammed Ali

Copyright:

Available Formats

Download as PPTX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 16

Search inside document

Welcome!!

Ladies & Gentleman in the course of

Natural Language
Processing

Instructor
M. Faheem Khan
Lecture No 3
Previous Review
01 Language Processing

02 Levels of Text Processing

03 Language Processing Pipeline

04 Stages of Comprehensive NLP System

05 Installation – Python - NLTK

06 Preprocessing Text - Tokenization

Today’s Agenda
01 Text Preprocessing

02 Text Tokenization

03 Text Normalization

04 Stop Word Removal

05 Part of Speech (POS)Tagging

06 Stemming & Lemmatization

Text Preprocessing

Text Preprocessing is traditionally an important step for NLP tasks. It

transforms text into more digestible form so that data can be used for
further processing tasks of NLP.
Text Preprocessing
Text Preprocessing involves following steps
Text Tokenization
Text Normalization
Stop Word Removal
POS Tagging
Stemming / Lemmatization
Removing HTML Tags
Removing extra spaces
Removing Numbers
Removing Special Characters
Expand Contractions
Conversion of Accented characters
Conversion of upper case into lower case
Conversion of lower case into upper case
Conversion of numbers words into numeric forms
Tokenization
The process of breaking down a text into smaller chunks such as words or sentence is called Tokenization.

Types of Tokenization
1- Word Tokenization
2- Sentence Tokenization
3- Punctuation Tokenization
4- Regular Expression Tokenization
Tokenization
Word Tokenization
Converting a sentence / paragraph into words.

from nltk.tokenize import word_tokenize

words = word_tokenize(text)
print(words)

We see something like this when we execute the above script:

Tokenization
Converting a paragraph into sentence.

from nltk.tokenize import sent_tokenize

sent = sent_tokenize(text)
print(sent)
Stop Words Removal
Just like when we talk to another person via a call, there tends to be some noise over the call
which is unwanted information. In the same manner, text from real world also contain noise which
is termed as Stopwords. Stopwords can vary from language to language but they can be easily
identified. Some of the Stopwords in English language can be – is, are, a, the, an etc. .

We can look at words which are considered as Stopwords by NLTK for English language with the
following code snippet:

from nltk.corpus import stopwords

nltk.download('stopwords')

language = "english"
stop_words = set(stopwords.words(language))
print(stop_words)
Stop Words Removal
Stop Words Removal
These stop words should be removed from the text if you
want to perform a precise text analysis for the piece of
text provided. Let’s remove the stop words from our
textual tokens:

filtered_words = []

for word in words:

if word not in stop_words:
filtered_words.append(word)

filtered_words
POS Tagging
:To identify and group each word in terms of their value, i.e. if each of the word is a noun or a verb
or something else. This is termed as Part of Speech tagging. Let’s perform POS tagging now:

tokens=nltk.word_tokenize(sentences[0])
print(tokens)

We see something like this when we execute the above script:

Frequency Distribution
:we can also calculate frequency of each word in the text we used. It is very simple to do with
NLTK, here is the code,

from nltk.probability import FreqDist

distribution = FreqDist(words)
print(distribution)

We see something like this when we execute the above script:

Frequency Distribution
Next, we can find most common words in the text with a simple function which accepts the number
of words to show:

# Most common words

distribution.most_common(2)

We see something like this when we execute the above script:

Thank you
Insert the title of your subtitle Here

Basic English Grammar PDF
Document34 pages
Basic English Grammar PDF
Ahmed Mohamed
100% (5)
The Hound of The Baskervilles by Arthur Conan Doyle
Document40 pages
The Hound of The Baskervilles by Arthur Conan Doyle
Lisa Ward
No ratings yet
Magoosh IELTS Vocabulary PDF
Document37 pages
Magoosh IELTS Vocabulary PDF
Omid
No ratings yet
Suffixes and Prefixes Dictionary
Document21 pages
Suffixes and Prefixes Dictionary
16mmsmile
No ratings yet
Grammatical Concepts
Document38 pages
Grammatical Concepts
joy canda
50% (2)
Adjective and Noun Forming Suffixes
Document41 pages
Adjective and Noun Forming Suffixes
Qushoy Al Ihsaniy
80% (5)
Knime - Words To Wisdom
Document177 pages
Knime - Words To Wisdom
rajasekhar
100% (2)
Natural Language Processing: Python and NLTK
From Everand
Natural Language Processing: Python and NLTK
Deepti Chopra
No ratings yet
Recurrent Neural Networks Tutorial, Part 2
Document16 pages
Recurrent Neural Networks Tutorial, Part 2
hoja
No ratings yet
NLP Steps Basic
Document26 pages
NLP Steps Basic
Madhu
No ratings yet
Translating in Translation: Tonkin@hartford - Edu
Document23 pages
Translating in Translation: Tonkin@hartford - Edu
alexluz8078
No ratings yet
Stylistic Analysis of A Poem
Document13 pages
Stylistic Analysis of A Poem
Yāsha Hernandez
100% (1)
Ass7 Write Up .Final
Document11 pages
Ass7 Write Up .Final
adagalepayale023
No ratings yet
Experiment - 2
Document3 pages
Experiment - 2
dscientist796
No ratings yet
Dsbdal A7
Document65 pages
Dsbdal A7
airprojectjnv2020
No ratings yet
Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No.1
Document6 pages
Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No. Department of Computer Engineering Exp. No.1
Ashish Patil
No ratings yet
Text Processing - Take Raw Input Text, Clean It,: The NLP Pipeline
Document6 pages
Text Processing - Take Raw Input Text, Clean It,: The NLP Pipeline
Allan Robey
No ratings yet
Unraveling The Power of Natural Language Processing
Document11 pages
Unraveling The Power of Natural Language Processing
suranifaizan52
No ratings yet
NLP TT-1 Question Bank
Document21 pages
NLP TT-1 Question Bank
Abhishek Tiwari
No ratings yet
NLP Notes and Related Questions
Document7 pages
NLP Notes and Related Questions
Pranjal Kapkar
No ratings yet
Final LP-VI NLP Manual 2023-24
Document29 pages
Final LP-VI NLP Manual 2023-24
shreyasnagare3635
No ratings yet
NLP Manual (1-12)
Document54 pages
NLP Manual (1-12)
sj120cp
No ratings yet
NLP Manual (1-12)
Document55 pages
NLP Manual (1-12)
sj120cp
No ratings yet
NLP Manual (1-12) 1
Document56 pages
NLP Manual (1-12) 1
sj120cp
No ratings yet
Sree017 NLP
Document3 pages
Sree017 NLP
Rahul Jaiswal
No ratings yet
TP1 3
Document5 pages
TP1 3
Younes Zizou
No ratings yet
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
Document18 pages
Top 30 NLP Interview Questions and Answers: 1. What Do You Understand by Natural Language Processing?
03sri03
No ratings yet
Chapter-1 Introduction To NLP
Document12 pages
Chapter-1 Introduction To NLP
Sruja Koshti
No ratings yet
Unit 1
Document4 pages
Unit 1
Shiv M
No ratings yet
NLP CT1
Document6 pages
NLP CT1
kz9057
No ratings yet
How To Translate The Crystal Master From Boyum IT
Document6 pages
How To Translate The Crystal Master From Boyum IT
Brayan Fdo Buitrago
No ratings yet
NLP Lab Manual-1
Document18 pages
NLP Lab Manual-1
kalanadhamganapathipavankumar
No ratings yet
Experiment No: 5 BE-COMP-B-26 Aim: Tools: Theory:: Implement Stop Word Removal Techniques. Python
Document2 pages
Experiment No: 5 BE-COMP-B-26 Aim: Tools: Theory:: Implement Stop Word Removal Techniques. Python
ROHIT SELVAM6
No ratings yet
Sentiment Analysis of Reviews Using ML: G.L.Bajaj Institute of Technology and Management
Document15 pages
Sentiment Analysis of Reviews Using ML: G.L.Bajaj Institute of Technology and Management
BHASKAR DUBEY
No ratings yet
Dav Exp7 56
Document8 pages
Dav Exp7 56
godizlatan
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
Document11 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
shoaib riaz
No ratings yet
Chapter 15 - MINING MEANING FROM TEXT
Document20 pages
Chapter 15 - MINING MEANING FROM TEXT
Simer Fibers
No ratings yet
NLTK 1
Document5 pages
NLTK 1
shahzad sultan
No ratings yet
Understanding Language Model
Document5 pages
Understanding Language Model
shahzad sultan
No ratings yet
AI Zone: Log in Sign Up
Document24 pages
AI Zone: Log in Sign Up
Anonymous TpYSenLO8a
No ratings yet
09 Rohit Jujaray NLP Experiments
Document24 pages
09 Rohit Jujaray NLP Experiments
NEMAT KHAN
No ratings yet
Assign2 Writeup
Document1 page
Assign2 Writeup
Shreeya Ganji
No ratings yet
Group A Assignment No: 7
Document10 pages
Group A Assignment No: 7
Shubham Dhanne
No ratings yet
NLP Programs
Document5 pages
NLP Programs
cnu.vadali
No ratings yet
Week 7 Introduction
Document2 pages
Week 7 Introduction
Shivam Yadav
No ratings yet
ECE Major Project Stage-1
Document16 pages
ECE Major Project Stage-1
Ritesh Pushkar
No ratings yet
Chapter - 1: Existing System
Document15 pages
Chapter - 1: Existing System
Bavithraa
No ratings yet
Lab2 IR
Document16 pages
Lab2 IR
Pac SaQii
No ratings yet
NLP Lect-6 03.02.21
Document17 pages
NLP Lect-6 03.02.21
Dnyanesh Bavkar
No ratings yet
Mail Type Spam Classifier: Abstarct
Document9 pages
Mail Type Spam Classifier: Abstarct
Muneer Ahmad
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
Document3 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
yousef shaban
No ratings yet
SI 413, Unit 4: Scanning and Parsing: 1 Syntax & Semantics
Document12 pages
SI 413, Unit 4: Scanning and Parsing: 1 Syntax & Semantics
Aditya Roy karmakar
No ratings yet
Seminar On Natural Language Processing
Document21 pages
Seminar On Natural Language Processing
Aman Bajaj
No ratings yet
Lexing and Tokens
Document6 pages
Lexing and Tokens
ricardoescuderorrss
No ratings yet
Introduction To Natural Language Processing and NLTK
Document23 pages
Introduction To Natural Language Processing and NLTK
Nikhil Saini
No ratings yet
IJDKP
Document7 pages
IJDKP
Lewis Torres
No ratings yet
NLP - Short Assignments
Document8 pages
NLP - Short Assignments
wemela1891
No ratings yet
Language Engineering - Section
Document24 pages
Language Engineering - Section
asmaa soliman
No ratings yet
EXPERIMENT NO 2 Shristi
Document3 pages
EXPERIMENT NO 2 Shristi
Shrishti Tiwari
No ratings yet
NLP Lect-5 02.02.21
Document18 pages
NLP Lect-5 02.02.21
Dnyanesh Bavkar
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
Document37 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
Zander Catta Preta
No ratings yet
10 Most Used NLP Techniques
Document7 pages
10 Most Used NLP Techniques
Eshaan Pandey
No ratings yet
02 Text Operation
Document52 pages
02 Text Operation
Mikiyas Abate
No ratings yet
NLTK Analysis 5
Document5 pages
NLTK Analysis 5
shahzad sultan
No ratings yet
Tokenization in NLP
Document10 pages
Tokenization in NLP
Bhumika Biyani
No ratings yet
Spacy Library
Document3 pages
Spacy Library
sanjay roka
No ratings yet
Write A Computer Language Using Go (Golang)
Document14 pages
Write A Computer Language Using Go (Golang)
Rf Hssn
100% (1)
Natural Language Processing
Document12 pages
Natural Language Processing
sakshi
No ratings yet
Welcome
Document8 pages
Welcome
vaibhavsiwach1
No ratings yet
Special Curriculum Program D2 Discussion
Document86 pages
Special Curriculum Program D2 Discussion
Jennelyn Concepcion
No ratings yet
Latin For First Yea 00 Gunn Rich
Document378 pages
Latin For First Yea 00 Gunn Rich
Catarina Mendes
No ratings yet
A Hybrid Approach To Amharic Base Phrase Chunking and Parsing
Document103 pages
A Hybrid Approach To Amharic Base Phrase Chunking and Parsing
እያኝ አፊጥጠህ
No ratings yet
Improve Your Reading Skills For 3rd B.tech Cec
Document6 pages
Improve Your Reading Skills For 3rd B.tech Cec
kbaluenglish
No ratings yet
Functional English by Sherazi
Document12 pages
Functional English by Sherazi
Shahid Abbas
No ratings yet
Iit M Quiz 1 Foundation Diploma QPC3
Document127 pages
Iit M Quiz 1 Foundation Diploma QPC3
newscribduser
No ratings yet
Mini - Lesson Plan 2
Document4 pages
Mini - Lesson Plan 2
Sekeithia Merritt-Laster
No ratings yet
Noun Phrase and Its Constituents
Document5 pages
Noun Phrase and Its Constituents
Jimena Ferrario
100% (1)
The Fundamental Element Which Is Common To All The Other Forms of The Word Is Called .
Document3 pages
The Fundamental Element Which Is Common To All The Other Forms of The Word Is Called .
alihamishee
No ratings yet
Spring Observation 2 Summer Wineteer
Document3 pages
Spring Observation 2 Summer Wineteer
api-340839569
No ratings yet
Module Chapter 1 - Parts of Speech
Document23 pages
Module Chapter 1 - Parts of Speech
napoleonhill815
No ratings yet
Morphology Manual Student's Book 2016
Document171 pages
Morphology Manual Student's Book 2016
Miya Ruslanova
No ratings yet
Abstracts - 11th PALA - Innsbruck 2011
Document22 pages
Abstracts - 11th PALA - Innsbruck 2011
PALA (Processability Approaches to Language Acquisition) International Symposium.
No ratings yet
Grammar Review For New Teachers - Revised Version
Document57 pages
Grammar Review For New Teachers - Revised Version
Claudio Reyes Durán
No ratings yet
Parts of Speech Test
Document2 pages
Parts of Speech Test
api-249283966
No ratings yet
List of Common Prepositions
Document13 pages
List of Common Prepositions
atik
No ratings yet
RPT Bi Yr 3 2017 by Amirul
Document38 pages
RPT Bi Yr 3 2017 by Amirul
Amirul Izzamil
No ratings yet
LECTURE 1 PARTS OF Speech Zniber
Document8 pages
LECTURE 1 PARTS OF Speech Zniber
Ci Kun Kun
No ratings yet
30 Tips For Writing Clearly
Document5 pages
30 Tips For Writing Clearly
Rata Suwantong
No ratings yet
Handsongrammar 5
Document31 pages
Handsongrammar 5
jamil ahmed
No ratings yet
ENLL121 Study Guide 2022
Document73 pages
ENLL121 Study Guide 2022
Phiwe Majokane
No ratings yet