0% found this document useful (0 votes)
17 views3 pages

Tokenizations

Tokenization in NLP is the process of splitting text into smaller units called tokens, such as words, subwords, characters, or sentences, and is essential for processing unstructured text. There are various types of tokenization, including sentence, word, subword, and character tokenization, each with specific methods like whitespace, rule-based, WordPiece, Byte-Pair Encoding (BPE), SentencePiece, and Unigram Language Model. Each method has its advantages and is used in different NLP models, with examples illustrating how specific words or phrases are tokenized.

Uploaded by

tharshiniapr26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views3 pages

Tokenizations

Tokenization in NLP is the process of splitting text into smaller units called tokens, such as words, subwords, characters, or sentences, and is essential for processing unstructured text. There are various types of tokenization, including sentence, word, subword, and character tokenization, each with specific methods like whitespace, rule-based, WordPiece, Byte-Pair Encoding (BPE), SentencePiece, and Unigram Language Model. Each method has its advantages and is used in different NLP models, with examples illustrating how specific words or phrases are tokenized.

Uploaded by

tharshiniapr26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Alright — let’s go deep into Tokenization in NLP with all the main types, methods, and examples.

1. What is Tokenization?
Tokenization is the process of splitting text into smaller units called tokens, which can be:
• Words ("Hello", "world")
• Subwords ("play", "##ing")
• Characters ("H", "e", "l", "l", "o")
• Sentences ("I am here.", "You are there.")
It’s usually the first step in NLP tasks — it converts unstructured text into a sequence the model can
process.

2. Types of Tokenization
A. Sentence Tokenization
• Breaks text into sentences.
• Useful for summarization, translation, etc.
• Example:
Input: "I love NLP. It is amazing!"
Output: ["I love NLP.", "It is amazing!"]
B. Word Tokenization
• Splits text into words based on spaces and punctuation.
• Example:
Input: "I love NLP."
Output: ["I", "love", "NLP", "."]
C. Subword Tokenization
• Splits words into smaller units when the whole word isn’t in the vocabulary.
• Helps handle rare/unknown words.
• Example (WordPiece):
"unhappiness" → ["un", "##happiness"]
D. Character Tokenization
• Breaks text into individual characters.
• Example:
"cat" → ["c", "a", "t"]

3. Methods of Tokenization
1. Whitespace Tokenization
• Splits text based on spaces.
• Example:
"I love NLP" → ["I", "love", "NLP"]
• Pros: Simple and fast.
• Cons: Can't handle punctuation well.

2. Rule-based Tokenization
• Uses hand-written rules (regex, punctuation handling).
• Example (Regex):
import re
re.findall(r"\w+|[^\w\s]", "I love NLP!")
# Output: ["I", "love", "NLP", "!"]

3. WordPiece Tokenization
• Used in BERT.
• Breaks rare words into subwords with ## prefix for continuations.
• Example:
"unhappiness" → ["un", "##happiness"]
"newword" → ["new", "##word"]

4. Byte-Pair Encoding (BPE)


• Merges the most frequent pair of tokens iteratively.
• Used in GPT-2, RoBERTa.
• Example:
Initial: ["l", "o", "w", " ", "e", "r"]
Merge ("l","o") → "lo"
Merge ("lo","w") → "low"

5. SentencePiece
• Treats the text as a sequence of characters (including spaces) and learns a vocabulary.
• Doesn’t require whitespace separation.
• Used in T5, ALBERT.
• Example:
"Hello world"
→ ["▁Hello", "▁world"]
(▁ represents a space.)

6. Unigram Language Model Tokenizer


• Starts with a large vocabulary and prunes tokens that give the highest likelihood for the corpus.
• Used in XLNet, SentencePiece.
• Example:
"playground" → ["play", "ground"]
If "playground" is rare → ["pla", "yground"]

4. Summary Table

Type Level Example Input Example Output Used In

Sentence Sentence "I love NLP. Wow!" ["I love NLP.", "Wow!"] NLTK

Word Word "I love NLP" ["I", "love", "NLP"] Spacy

Subword Subword "unhappiness" ["un", "##happiness"] BERT

Character Character "cat" ["c", "a", "t"] OCR

Whitespace Word "Hello world" ["Hello", "world"] Simple

WordPiece Subword "playing" ["play", "##ing"] BERT

BPE Subword "lower" ["low", "er"] GPT-2

SentencePiece Subword "Hello world" ["▁Hello", "▁world"] T5

If you want, I can draw a diagram showing exactly how "unhappiness" would be tokenized by
whitespace, WordPiece, BPE, and SentencePiece so you can visually compare them. That’s the easiest
way to remember the differences.

You might also like