Professional Documents
Culture Documents
1
Using Zoom: Viewing Mode
2
Introduction to Text
Processing Using
Natural Language
Tool Kit (NLTK)
Dr. Rianto, S.Kom., M. Eng.
What is Language????
Way of Communication
Speaker Listener
Difference Between
Natural Language
and Computer
Language
Feature
Selection
Language
Clustering
Identification
Text mining process
Text preprocessing
Syntactic/Semantic text analysis
Features Generation
Bag of words
Features Selection
Simple counting
Statistics
Text/Data Mining
Classification- Supervised learning
Clustering- Unsupervised learning
Analyzing results
Mapping/Visualization
Result interpretation
Remove HTML
Lemmatization or Stemming
Case Folding
○ KOMPUTER
○ Komputer
○ KomPuTer
○ komPUTER
○ Komputer
Case Folding
Regex
Punctuation
Tokenization
● Basic concept
○ filtering out words with very low discrimination values
■ ex) a, the, this, that, where, when, ….
● Advantage
○ reduce the size of the indexing structure considerably
● Disadvantage
○ might reduce recall as well
■ ex) “to be or not to be”
After Removal of Stop Words
23
Stemming Examples
Pre-given categories and labeled document examples (Categories may form hierarchy)
Classify new documents
A standard classification (supervised learning ) problem
Sports
Categorization
System Business
Education
… …
Sports
Business Science
Education
A GRAPHICAL VIEW OF TEXT
CLASSIFICATION
Arch.
Graphics
Theory
NLP AI
EXAMPLES OF TEXT Classification
● LABELS=BINARY
○ “spam” / “not spam”
● LABELS=TOPICS
○ “finance” / “sports” / “asia”
● LABELS=OPINION
○ “like” / “hate” / “neutral”
● LABELS=AUTHOR
○ “Shakespeare” / “Marlowe” / “Ben Jonson”
○ The Federalist papers
Support Vector Machine
Class 1
Any Question?
30
Your Feedback Matters!
bit.ly/3hmJ3Nr
31