You are on page 1of 31

We’ll Be Starting Shortly!

To help us run the workshop smoothly, kindly:


- Submit all questions using the Q&A
function
- If you have an urgent request, please
use the “Raise Hand” function

1
Using Zoom: Viewing Mode

2
Introduction to Text
Processing Using
Natural Language
Tool Kit (NLTK)
Dr. Rianto, S.Kom., M. Eng.
What is Language????

Way of Communication
Speaker Listener
Difference Between
Natural Language
and Computer
Language

● Natural Language > Computer language


○ Ambiguous Non-ambiguous
○ Context Sensitive Context free
○ Informal Formal
○ Descriptive Prescriptive
○ Unstructured Structured
○ Uncontrolled Controlled
Types of languages

● Natural languages ● Computer languages


○ Also called: ○ Also called:
■ Informal language ■ Formal language
■ Unstructured language ■ Structured language
■ Non-regular language ■ Regular language
Aspects of language processing

● Word, lexicon: lexical analysis


○ Morphology, word segmentation
● Syntax
○ Sentence structure, phrase, grammar, …
● Semantics
○ Meaning
○ Execute commands
● Discourse analysis
○ Meaning of a text
○ Relationship between sentences
Computational
Linguistics
Computational linguistics is an interdisciplinary field
dealing with the statistical and/or rule-based modeling of
natural language from a computational perspective.
Applications

DETECT NEW LANGUAGE MACHINE NL INTERFACE INFORMATION …


WORDS LEARNING TRANSLATION RETRIEVAL
Summerization
Classification
Text Analysis

Feature
Selection

Language
Clustering
Identification
Text mining process
Text preprocessing
Syntactic/Semantic text analysis

Features Generation
Bag of words

Features Selection
Simple counting
Statistics

Text/Data Mining
Classification- Supervised learning
Clustering- Unsupervised learning

Analyzing results
Mapping/Visualization
Result interpretation

Iterative and interactive process


Introduction to NLTK

● The Natural Language Toolkit (NLTK) provides:


○ Basic classes for representing data relevant to natural language
processing.
○ Standard interfaces for performing tasks, such as tokenization,
tagging, and parsing.
○ Standard implementations of each task, which can be combined to
solve complex problems.
Preprocessing

Remove HTML

Tokenization + Remove punctuation

Remove stop words and other useless words

Lemmatization or Stemming
Case Folding

A process applied to a sequence of characters, in which those identified as


non-uppercase are replaced by their uppercase equivalents

○ KOMPUTER
○ Komputer
○ KomPuTer
○ komPUTER
○ Komputer
Case Folding
Regex
Punctuation
Tokenization

The simplest way to represent a text is with a single string.

Difficult to process text in this format.

Often, it is more convenient to work with a list of tokens.

The task of converting a text from a single string to a list of tokens is


known as tokenization.
Tokenizing
Elimination of Stopwords

● Basic concept
○ filtering out words with very low discrimination values
■ ex) a, the, this, that, where, when, ….
● Advantage
○ reduce the size of the indexing structure considerably
● Disadvantage
○ might reduce recall as well
■ ex) “to be or not to be”
After Removal of Stop Words

With STOP words removed, the text might look like:


Stemming

What is the “stem”? Effect of stemming


the portion of a word which is left reduce variants of the same root
after the removal of its affixes (i.e., to a common concept
prefixes and suffixes) reduce the size of the indexing
ex) ‘connect’ is the stem for the structure
variants ‘connected’, ‘connecting’, controversy about the benefits of
‘connection’, ‘connections’ stemming
Stopwords
Removing

23
Stemming Examples

BIG: BIG, BIGGER, BIGGEST


REACH: REACH, REACHES, REACHED, REACHING
WORK: WORK, WORKS, WORKED, WORKING

CHILD: CHILD, CHILDREN


KNIFE: KNIFE, KNIVES

PERRO: PERRO, PERRA (Spanish, male and female dog)


Text Classification

 Pre-given categories and labeled document examples (Categories may form hierarchy)
 Classify new documents
 A standard classification (supervised learning ) problem

Sports
Categorization
System Business

Education
… …
Sports
Business Science
Education
A GRAPHICAL VIEW OF TEXT
CLASSIFICATION

Arch.
Graphics

Theory
NLP AI
EXAMPLES OF TEXT Classification

● LABELS=BINARY
○ “spam” / “not spam”
● LABELS=TOPICS
○ “finance” / “sports” / “asia”
● LABELS=OPINION
○ “like” / “hate” / “neutral”
● LABELS=AUTHOR
○ “Shakespeare” / “Marlowe” / “Ben Jonson”
○ The Federalist papers
Support Vector Machine

● SVM: A Large-Margin Classifier


○ Linear SVM
○ Kernel Trick
○ Fast implementation: SMO
● SVM for Text Classification
○ Multi-class Classification
○ Multi-label Classification
○ Hierarchical Classification Tool
What is a Good Decision Boundary?

● Consider a two-class, linearly separable


classification problem
Class 2
● Many decision boundaries!
○ The Perceptron algorithm can be
used to find such a boundary
● Are all decision boundaries equally good?

Class 1
Any Question?

30
Your Feedback Matters!

bit.ly/3hmJ3Nr

31

You might also like