Alright — let’s go deep into Tokenization in NLP with all the main types, methods, and examples.
1. What is Tokenization?
Tokenization is the process of splitting text into smaller units called tokens, which can be:
• Words ("Hello", "world")
• Subwords ("play", "##ing")
• Characters ("H", "e", "l", "l", "o")
• Sentences ("I am here.", "You are there.")
It’s usually the first step in NLP tasks — it converts unstructured text into a sequence the model can
process.
2. Types of Tokenization
A. Sentence Tokenization
• Breaks text into sentences.
• Useful for summarization, translation, etc.
• Example:
Input: "I love NLP. It is amazing!"
Output: ["I love NLP.", "It is amazing!"]
B. Word Tokenization
• Splits text into words based on spaces and punctuation.
• Example:
Input: "I love NLP."
Output: ["I", "love", "NLP", "."]
C. Subword Tokenization
• Splits words into smaller units when the whole word isn’t in the vocabulary.
• Helps handle rare/unknown words.
• Example (WordPiece):
"unhappiness" → ["un", "##happiness"]
D. Character Tokenization
• Breaks text into individual characters.
• Example:
"cat" → ["c", "a", "t"]
3. Methods of Tokenization
1. Whitespace Tokenization
• Splits text based on spaces.
• Example:
"I love NLP" → ["I", "love", "NLP"]
• Pros: Simple and fast.
• Cons: Can't handle punctuation well.
2. Rule-based Tokenization
• Uses hand-written rules (regex, punctuation handling).
• Example (Regex):
import re
re.findall(r"\w+|[^\w\s]", "I love NLP!")
# Output: ["I", "love", "NLP", "!"]
3. WordPiece Tokenization
• Used in BERT.
• Breaks rare words into subwords with ## prefix for continuations.
• Example:
"unhappiness" → ["un", "##happiness"]
"newword" → ["new", "##word"]
4. Byte-Pair Encoding (BPE)
• Merges the most frequent pair of tokens iteratively.
• Used in GPT-2, RoBERTa.
• Example:
Initial: ["l", "o", "w", " ", "e", "r"]
Merge ("l","o") → "lo"
Merge ("lo","w") → "low"
5. SentencePiece
• Treats the text as a sequence of characters (including spaces) and learns a vocabulary.
• Doesn’t require whitespace separation.
• Used in T5, ALBERT.
• Example:
"Hello world"
→ ["▁Hello", "▁world"]
(▁ represents a space.)
6. Unigram Language Model Tokenizer
• Starts with a large vocabulary and prunes tokens that give the highest likelihood for the corpus.
• Used in XLNet, SentencePiece.
• Example:
"playground" → ["play", "ground"]
If "playground" is rare → ["pla", "yground"]
4. Summary Table
Type Level Example Input Example Output Used In
Sentence Sentence "I love NLP. Wow!" ["I love NLP.", "Wow!"] NLTK
Word Word "I love NLP" ["I", "love", "NLP"] Spacy
Subword Subword "unhappiness" ["un", "##happiness"] BERT
Character Character "cat" ["c", "a", "t"] OCR
Whitespace Word "Hello world" ["Hello", "world"] Simple
WordPiece Subword "playing" ["play", "##ing"] BERT
BPE Subword "lower" ["low", "er"] GPT-2
SentencePiece Subword "Hello world" ["▁Hello", "▁world"] T5
If you want, I can draw a diagram showing exactly how "unhappiness" would be tokenized by
whitespace, WordPiece, BPE, and SentencePiece so you can visually compare them. That’s the easiest
way to remember the differences.