Tokenizations

Tokenization in NLP is the process of splitting text into smaller units called tokens, such as words, subwords, characters, or sentences, and is essential for processing unstructured text. There are various types of tokenization, including sentence, word, subword, and character tokenization, each with specific methods like whitespace, rule-based, WordPiece, Byte-Pair Encoding (BPE), SentencePiece, and Unigram Language Model. Each method has its advantages and is used in different NLP models, with examples illustrating how specific words or phrases are tokenized.

Uploaded by

tharshiniapr26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views3 pages

Tokenizations

Uploaded by

tharshiniapr26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Alright — let’s go deep into Tokenization in NLP with all the main types, methods, and examples.

1. What is Tokenization?
Tokenization is the process of splitting text into smaller units called tokens, which can be:
• Words ("Hello", "world")
• Subwords ("play", "##ing")
• Characters ("H", "e", "l", "l", "o")
• Sentences ("I am here.", "You are there.")
It’s usually the first step in NLP tasks — it converts unstructured text into a sequence the model can
process.

2. Types of Tokenization
A. Sentence Tokenization
• Breaks text into sentences.
• Useful for summarization, translation, etc.
• Example:
Input: "I love NLP. It is amazing!"
Output: ["I love NLP.", "It is amazing!"]
B. Word Tokenization
• Splits text into words based on spaces and punctuation.
• Example:
Input: "I love NLP."
Output: ["I", "love", "NLP", "."]
C. Subword Tokenization
• Splits words into smaller units when the whole word isn’t in the vocabulary.
• Helps handle rare/unknown words.
• Example (WordPiece):
"unhappiness" → ["un", "##happiness"]
D. Character Tokenization
• Breaks text into individual characters.
• Example:
"cat" → ["c", "a", "t"]

3. Methods of Tokenization
1. Whitespace Tokenization
• Splits text based on spaces.
• Example:
"I love NLP" → ["I", "love", "NLP"]
• Pros: Simple and fast.
• Cons: Can't handle punctuation well.

2. Rule-based Tokenization
• Uses hand-written rules (regex, punctuation handling).
• Example (Regex):
import re
re.findall(r"\w+|[^\w\s]", "I love NLP!")
# Output: ["I", "love", "NLP", "!"]

3. WordPiece Tokenization
• Used in BERT.
• Breaks rare words into subwords with ## prefix for continuations.
• Example:
"unhappiness" → ["un", "##happiness"]
"newword" → ["new", "##word"]

4. Byte-Pair Encoding (BPE)

• Merges the most frequent pair of tokens iteratively.
• Used in GPT-2, RoBERTa.
• Example:
Initial: ["l", "o", "w", " ", "e", "r"]
Merge ("l","o") → "lo"
Merge ("lo","w") → "low"

5. SentencePiece
• Treats the text as a sequence of characters (including spaces) and learns a vocabulary.
• Doesn’t require whitespace separation.
• Used in T5, ALBERT.
• Example:
"Hello world"
→ ["▁Hello", "▁world"]
(▁ represents a space.)

6. Unigram Language Model Tokenizer

• Starts with a large vocabulary and prunes tokens that give the highest likelihood for the corpus.
• Used in XLNet, SentencePiece.
• Example:
"playground" → ["play", "ground"]
If "playground" is rare → ["pla", "yground"]

4. Summary Table

Type Level Example Input Example Output Used In

Sentence Sentence "I love NLP. Wow!" ["I love NLP.", "Wow!"] NLTK

Word Word "I love NLP" ["I", "love", "NLP"] Spacy

Subword Subword "unhappiness" ["un", "##happiness"] BERT

Character Character "cat" ["c", "a", "t"] OCR

Whitespace Word "Hello world" ["Hello", "world"] Simple

WordPiece Subword "playing" ["play", "##ing"] BERT

BPE Subword "lower" ["low", "er"] GPT-2

SentencePiece Subword "Hello world" ["▁Hello", "▁world"] T5

If you want, I can draw a diagram showing exactly how "unhappiness" would be tokenized by
whitespace, WordPiece, BPE, and SentencePiece so you can visually compare them. That’s the easiest
way to remember the differences.

Lecture 2 Tokenization
No ratings yet
Lecture 2 Tokenization
16 pages
Understanding Semantic Analysis in NLP
No ratings yet
Understanding Semantic Analysis in NLP
65 pages
Week 1
No ratings yet
Week 1
14 pages
NLP Exp1
No ratings yet
NLP Exp1
4 pages
Tokenizer
No ratings yet
Tokenizer
4 pages
Slide 2 Introduction To Text Tokeni
No ratings yet
Slide 2 Introduction To Text Tokeni
5 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
2.2text Preprocessing Tokanization
No ratings yet
2.2text Preprocessing Tokanization
3 pages
Unit 1 - Tokenisation Text
No ratings yet
Unit 1 - Tokenisation Text
5 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
Python Sentence Tokenization Methods
No ratings yet
Python Sentence Tokenization Methods
3 pages
Understanding Tokenization in NLP
No ratings yet
Understanding Tokenization in NLP
11 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
NLP Short Notes
No ratings yet
NLP Short Notes
21 pages
Understanding Text Tokenization in NLP
No ratings yet
Understanding Text Tokenization in NLP
4 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Tokenization and Filtration in Python
No ratings yet
Tokenization and Filtration in Python
5 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP and Computational Linguistics Overview
No ratings yet
NLP and Computational Linguistics Overview
60 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
Tokenization in Santali Language NLP
No ratings yet
Tokenization in Santali Language NLP
16 pages
NLP - Shortnotes Unit 1 & 2
100% (1)
NLP - Shortnotes Unit 1 & 2
16 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLP Tokenization Basics
No ratings yet
NLP Tokenization Basics
3 pages
NLP Applications and Text Preprocessing
No ratings yet
NLP Applications and Text Preprocessing
56 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP m2
No ratings yet
NLP m2
71 pages
NLP Lab Manual for Python Programming
No ratings yet
NLP Lab Manual for Python Programming
41 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
10 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
15 pages
NLP 02
No ratings yet
NLP 02
6 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Main Topics: Start With A Checkmark Followed by The Topic Name
No ratings yet
Main Topics: Start With A Checkmark Followed by The Topic Name
48 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Practicals All
No ratings yet
NLP Practicals All
57 pages
AMLTA
No ratings yet
AMLTA
17 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Tokenization Techniques
No ratings yet
NLP Tokenization Techniques
11 pages
NLP Practical Lab Manual for CSE Students
No ratings yet
NLP Practical Lab Manual for CSE Students
32 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
NLP with NLTK in Python Guide
No ratings yet
NLP with NLTK in Python Guide
5 pages
NLP Test Questions With Answers
No ratings yet
NLP Test Questions With Answers
3 pages
NLP Lab Manual Final
No ratings yet
NLP Lab Manual Final
25 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
Natural Language Processing - Session 4 - Tokenization and Stemming
No ratings yet
Natural Language Processing - Session 4 - Tokenization and Stemming
63 pages
Adnan Amin
No ratings yet
Adnan Amin
19 pages
2 Marks
No ratings yet
2 Marks
22 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
NLP Practical Journal with Python Code
No ratings yet
NLP Practical Journal with Python Code
17 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
NLP Practical Journal for M.Sc. IT
No ratings yet
NLP Practical Journal for M.Sc. IT
22 pages
Experiment 2
No ratings yet
Experiment 2
4 pages
NLP Ans
No ratings yet
NLP Ans
91 pages
Cohort 1 - NLP and LLM - Intro To NLP
No ratings yet
Cohort 1 - NLP and LLM - Intro To NLP
94 pages
Hands-On Large Language Models 2025 - Feng & Kong
No ratings yet
Hands-On Large Language Models 2025 - Feng & Kong
104 pages
Tokenization
No ratings yet
Tokenization
34 pages
Tokenization
No ratings yet
Tokenization
26 pages
BPE Tokenization for NLP Enthusiasts
No ratings yet
BPE Tokenization for NLP Enthusiasts
17 pages
Session 2 Tokens and Tokenization
No ratings yet
Session 2 Tokens and Tokenization
22 pages
Tokenizations in LLM
No ratings yet
Tokenizations in LLM
9 pages
Comprehensive Survey of Tokenization Methods in Language Models
No ratings yet
Comprehensive Survey of Tokenization Methods in Language Models
15 pages
Pres 1 Neural Machine Translation of Rare Words With Subword Units
No ratings yet
Pres 1 Neural Machine Translation of Rare Words With Subword Units
25 pages
Assignment No 1 - Genai
No ratings yet
Assignment No 1 - Genai
10 pages
Tokenization
No ratings yet
Tokenization
7 pages
Assignment Fa24 MSDS 0006
No ratings yet
Assignment Fa24 MSDS 0006
14 pages
Chapter 2. Tokens and Embeddings
No ratings yet
Chapter 2. Tokens and Embeddings
38 pages
Token Ization
No ratings yet
Token Ization
15 pages
Comprehensive Survey of Tokenization Methods in Language Models
No ratings yet
Comprehensive Survey of Tokenization Methods in Language Models
19 pages
Mastering Regular Expressions for Text Processing
No ratings yet
Mastering Regular Expressions for Text Processing
71 pages
NLP Week 02
No ratings yet
NLP Week 02
54 pages
Tokenization in NLP
No ratings yet
Tokenization in NLP
9 pages

Tokenizations

Uploaded by

Tokenizations

Uploaded by

Alright — let’s go deep into Tokenization in NLP with all the main types, methods, and examples.

4. Byte-Pair Encoding (BPE)

6. Unigram Language Model Tokenizer

Type Level Example Input Example Output Used In

Word Word "I love NLP" ["I", "love", "NLP"] Spacy

Subword Subword "unhappiness" ["un", "##happiness"] BERT

Character Character "cat" ["c", "a", "t"] OCR

Whitespace Word "Hello world" ["Hello", "world"] Simple

WordPiece Subword "playing" ["play", "##ing"] BERT

BPE Subword "lower" ["low", "er"] GPT-2

SentencePiece Subword "Hello world" ["▁Hello", "▁world"] T5

You might also like