15 KiB
Tokenization Algorithms Deep Dive
Comprehensive explanation of BPE, WordPiece, and Unigram algorithms.
Byte-Pair Encoding (BPE)
Algorithm overview
BPE iteratively merges the most frequent pair of tokens in a corpus.
Training process:
- Initialize vocabulary with all characters
- Count frequency of all adjacent token pairs
- Merge most frequent pair into new token
- Add new token to vocabulary
- Update corpus with new token
- Repeat until vocabulary size reached
Step-by-step example
Corpus:
low: 5
lower: 2
newest: 6
widest: 3
Iteration 1:
Count pairs:
'e' + 's': 9 (newest: 6, widest: 3) ← most frequent
'l' + 'o': 7
'o' + 'w': 7
...
Merge: 'e' + 's' → 'es'
Updated corpus:
low: 5
lower: 2
newest: 6 → newes|t: 6
widest: 3 → wides|t: 3
Vocabulary: [a-z] + ['es']
Iteration 2:
Count pairs:
'es' + 't': 9 ← most frequent
'l' + 'o': 7
...
Merge: 'es' + 't' → 'est'
Updated corpus:
low: 5
lower: 2
newest: 6 → new|est: 6
widest: 3 → wid|est: 3
Vocabulary: [a-z] + ['es', 'est']
Continue until desired vocabulary size...
Tokenization with trained BPE
Given vocabulary: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', 'es', 'est', 'lo', 'low', 'ne', 'new', 'newest', 'wi', 'wid', 'widest']
Tokenize "lowest":
Step 1: Split into characters
['l', 'o', 'w', 'e', 's', 't']
Step 2: Apply merges in order learned during training
- Merge 'l' + 'o' → 'lo' (if this merge was learned)
- Merge 'lo' + 'w' → 'low' (if learned)
- Merge 'e' + 's' → 'es' (learned)
- Merge 'es' + 't' → 'est' (learned)
Final: ['low', 'est']
Implementation
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# Initialize
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
# Configure trainer
trainer = BpeTrainer(
vocab_size=1000,
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
# Train
corpus = [
"This is a sample corpus for BPE training.",
"BPE learns subword units from the training data.",
# ... more sentences
]
tokenizer.train_from_iterator(corpus, trainer=trainer)
# Use
output = tokenizer.encode("This is tokenization")
print(output.tokens) # ['This', 'is', 'token', 'ization']
Byte-level BPE (GPT-2 variant)
Problem: Standard BPE has limited character coverage (256+ Unicode chars)
Solution: Operate on byte level (256 bytes)
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
tokenizer = Tokenizer(BPE())
# Byte-level pre-tokenization
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()
# This handles ALL possible characters, including emojis
text = "Hello 🌍 世界"
tokens = tokenizer.encode(text).tokens
Advantages:
- Handles any Unicode character (256 byte coverage)
- No unknown tokens (worst case: bytes)
- Used by GPT-2, GPT-3, BART
Trade-offs:
- Slightly worse compression (bytes vs characters)
- More tokens for non-ASCII text
BPE variants
SentencePiece BPE:
- Language-independent (no pre-tokenization)
- Treats input as raw byte stream
- Used by T5, ALBERT, XLNet
Robust BPE:
- Dropout during training (randomly skip merges)
- More robust tokenization at inference
- Reduces overfitting to training data
WordPiece
Algorithm overview
WordPiece is similar to BPE but uses a different merge selection criterion.
Training process:
- Initialize vocabulary with all characters
- Count frequency of all token pairs
- Score each pair:
score = freq(pair) / (freq(first) × freq(second)) - Merge pair with highest score
- Repeat until vocabulary size reached
Why different scoring?
BPE: Merges most frequent pairs
- "aa" appears 100 times → high priority
- Even if 'a' appears 1000 times alone
WordPiece: Merges pairs that are semantically related
- "aa" appears 100 times, 'a' appears 1000 times → low score (100 / (1000 × 1000))
- "th" appears 50 times, 't' appears 60 times, 'h' appears 55 times → high score (50 / (60 × 55))
- Prioritizes pairs that appear together more than expected
Step-by-step example
Corpus:
low: 5
lower: 2
newest: 6
widest: 3
Iteration 1:
Count frequencies:
'e': 11 (lower: 2, newest: 6, widest: 3)
's': 9
't': 9
...
Count pairs:
'e' + 's': 9 (newest: 6, widest: 3)
'es' + 't': 9 (newest: 6, widest: 3)
...
Compute scores:
score('e' + 's') = 9 / (11 × 9) = 0.091
score('es' + 't') = 9 / (9 × 9) = 0.111 ← highest score
score('l' + 'o') = 7 / (7 × 9) = 0.111 ← tied
Choose: 'es' + 't' → 'est' (or 'lo' if tied)
Key difference: WordPiece prioritizes rare combinations over frequent ones.
Tokenization with WordPiece
Given vocabulary: ['##e', '##s', '##t', 'l', 'o', 'w', 'new', 'est', 'low']
Tokenize "lowest":
Step 1: Find longest matching prefix
'lowest' → 'low' (matches)
Step 2: Find longest match for remainder
'est' → 'est' (matches)
Final: ['low', 'est']
If no match:
Tokenize "unknownword":
'unknownword' → no match
'unknown' → no match
'unkn' → no match
'un' → no match
'u' → no match
→ [UNK]
Implementation
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.normalizers import BertNormalizer
from tokenizers.pre_tokenizers import BertPreTokenizer
# Initialize BERT-style tokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
# Normalization (lowercase, accent stripping)
tokenizer.normalizer = BertNormalizer(lowercase=True)
# Pre-tokenization (whitespace + punctuation)
tokenizer.pre_tokenizer = BertPreTokenizer()
# Configure trainer
trainer = WordPieceTrainer(
vocab_size=30522, # BERT vocab size
min_frequency=2,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
continuing_subword_prefix="##" # BERT uses ##
)
# Train
tokenizer.train_from_iterator(corpus, trainer=trainer)
# Use
output = tokenizer.encode("Tokenization works great!")
print(output.tokens) # ['token', '##ization', 'works', 'great', '!']
Subword prefix
BERT uses ## prefix:
"unbelievable" → ['un', '##believ', '##able']
Why?
- Indicates token is a continuation
- Allows reconstruction: remove ##, concatenate
- Helps model distinguish word boundaries
WordPiece advantages
Semantic merges:
- Prioritizes meaningful combinations
- "qu" has high score (always together)
- "qx" has low score (rare combination)
Better for morphology:
- Captures affixes: un-, -ing, -ed
- Preserves word stems
Trade-offs:
- Slower training than BPE
- More memory (stores vocabulary, not merges)
- Original implementation not open-source (HF reimplementation)
Unigram
Algorithm overview
Unigram works backward: start with large vocabulary, remove tokens.
Training process:
- Initialize with large vocabulary (all substrings)
- Estimate probability of each token (frequency-based)
- For each token, compute loss increase if removed
- Remove 10-20% of tokens with lowest loss impact
- Re-estimate probabilities
- Repeat until desired vocabulary size
Probabilistic tokenization
Unigram assumption: Each token is independent.
Given vocabulary with probabilities:
P('low') = 0.02
P('l') = 0.01
P('o') = 0.015
P('w') = 0.01
P('est') = 0.03
P('e') = 0.02
P('s') = 0.015
P('t') = 0.015
Tokenize "lowest":
Option 1: ['low', 'est']
P = P('low') × P('est') = 0.02 × 0.03 = 0.0006
Option 2: ['l', 'o', 'w', 'est']
P = 0.01 × 0.015 × 0.01 × 0.03 = 0.000000045
Option 3: ['low', 'e', 's', 't']
P = 0.02 × 0.02 × 0.015 × 0.015 = 0.0000009
Choose option 1 (highest probability)
Viterbi algorithm
Finding best tokenization is expensive (exponential possibilities).
Viterbi algorithm (dynamic programming):
def tokenize_viterbi(word, vocab, probs):
n = len(word)
# dp[i] = (best_prob, best_tokens) for word[:i]
dp = [{} for _ in range(n + 1)]
dp[0] = (0.0, []) # log probability
for i in range(1, n + 1):
best_prob = float('-inf')
best_tokens = []
# Try all possible last tokens
for j in range(i):
token = word[j:i]
if token in vocab:
prob = dp[j][0] + log(probs[token])
if prob > best_prob:
best_prob = prob
best_tokens = dp[j][1] + [token]
dp[i] = (best_prob, best_tokens)
return dp[n][1]
Time complexity: O(n² × vocab_size) vs O(2^n) brute force
Implementation
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
# Initialize
tokenizer = Tokenizer(Unigram())
# Configure trainer
trainer = UnigramTrainer(
vocab_size=8000,
special_tokens=["<unk>", "<s>", "</s>"],
unk_token="<unk>",
max_piece_length=16, # Max token length
n_sub_iterations=2, # EM iterations
shrinking_factor=0.75 # Remove 25% each iteration
)
# Train
tokenizer.train_from_iterator(corpus, trainer=trainer)
# Use
output = tokenizer.encode("Tokenization with Unigram")
print(output.tokens) # ['▁Token', 'ization', '▁with', '▁Un', 'igram']
Unigram advantages
Probabilistic:
- Multiple valid tokenizations
- Can sample different tokenizations (data augmentation)
Subword regularization:
# Sample different tokenizations
for _ in range(3):
tokens = tokenizer.encode("tokenization", is_pretokenized=False).tokens
print(tokens)
# Output (different each time):
# ['token', 'ization']
# ['tok', 'en', 'ization']
# ['token', 'iz', 'ation']
Language-independent:
- No word boundaries needed
- Works for CJK languages (Chinese, Japanese, Korean)
- Treats input as character stream
Trade-offs:
- Slower training (EM algorithm)
- More hyperparameters
- Larger model (stores probabilities)
Algorithm comparison
Training speed
| Algorithm | Small (10MB) | Medium (100MB) | Large (1GB) |
|---|---|---|---|
| BPE | 10-15 sec | 1-2 min | 10-20 min |
| WordPiece | 15-20 sec | 2-3 min | 15-30 min |
| Unigram | 20-30 sec | 3-5 min | 30-60 min |
Tested on: 16-core CPU, 30k vocab
Tokenization quality
Tested on English Wikipedia (perplexity measurement):
| Algorithm | Vocab Size | Tokens/Word | Unknown Rate |
|---|---|---|---|
| BPE | 30k | 1.3 | 0.5% |
| WordPiece | 30k | 1.2 | 1.2% |
| Unigram | 8k | 1.5 | 0.3% |
Key observations:
- WordPiece: Slightly better compression
- BPE: Lower unknown rate
- Unigram: Smallest vocab, good coverage
Compression ratio
Characters per token (higher = better compression):
| Language | BPE (30k) | WordPiece (30k) | Unigram (8k) |
|---|---|---|---|
| English | 4.2 | 4.5 | 3.8 |
| Chinese | 2.1 | 2.3 | 2.5 |
| Arabic | 3.5 | 3.8 | 3.2 |
Best for each:
- English: WordPiece
- Chinese: Unigram (language-independent)
- Arabic: WordPiece
Use case recommendations
BPE - Best for:
- English language models
- Code (handles symbols well)
- Fast training needed
- Models: GPT-2, GPT-3, RoBERTa, BART
WordPiece - Best for:
- Masked language modeling (BERT-style)
- Morphologically rich languages
- Semantic understanding tasks
- Models: BERT, DistilBERT, ELECTRA
Unigram - Best for:
- Multilingual models
- Languages without word boundaries (CJK)
- Data augmentation via subword regularization
- Models: T5, ALBERT, XLNet (via SentencePiece)
Advanced topics
Handling rare words
BPE approach:
"antidisestablishmentarianism"
→ ['anti', 'dis', 'establish', 'ment', 'arian', 'ism']
WordPiece approach:
"antidisestablishmentarianism"
→ ['anti', '##dis', '##establish', '##ment', '##arian', '##ism']
Unigram approach:
"antidisestablishmentarianism"
→ ['▁anti', 'dis', 'establish', 'ment', 'arian', 'ism']
Handling numbers
Challenge: Infinite number combinations
BPE solution: Byte-level (handles any digit sequence)
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = ByteLevel()
# Handles any number
"123456789" → byte-level tokens
WordPiece solution: Digit pre-tokenization
from tokenizers.pre_tokenizers import Digits
# Split digits individually or as groups
tokenizer.pre_tokenizer = Digits(individual_digits=True)
"123" → ['1', '2', '3']
Unigram solution: Learns common number patterns
# Learns patterns during training
"2023" → ['202', '3'] or ['20', '23']
Handling case sensitivity
Lowercase (BERT):
from tokenizers.normalizers import Lowercase
tokenizer.normalizer = Lowercase()
"Hello WORLD" → "hello world" → ['hello', 'world']
Preserve case (GPT-2):
# No case normalization
tokenizer.normalizer = None
"Hello WORLD" → ['Hello', 'WORLD']
Cased tokens (RoBERTa):
# Learns separate tokens for different cases
Vocabulary: ['Hello', 'hello', 'HELLO', 'world', 'WORLD']
Handling emojis and special characters
Byte-level (GPT-2):
tokenizer.pre_tokenizer = ByteLevel()
"Hello 🌍 👋" → byte-level representation (always works)
Unicode normalization:
from tokenizers.normalizers import NFKC
tokenizer.normalizer = NFKC()
"é" (composed) ↔ "é" (decomposed) → normalized to one form
Troubleshooting
Issue: Poor subword splitting
Symptom:
"running" → ['r', 'u', 'n', 'n', 'i', 'n', 'g'] (too granular)
Solutions:
- Increase vocabulary size
- Train longer (more merge iterations)
- Lower
min_frequencythreshold
Issue: Too many unknown tokens
Symptom:
5% of tokens are [UNK]
Solutions:
- Increase vocabulary size
- Use byte-level BPE (no UNK possible)
- Verify training corpus is representative
Issue: Inconsistent tokenization
Symptom:
"running" → ['run', 'ning']
"runner" → ['r', 'u', 'n', 'n', 'e', 'r']
Solutions:
- Check normalization consistency
- Ensure pre-tokenization is deterministic
- Use Unigram for probabilistic variance
Best practices
-
Match algorithm to model architecture:
- BERT-style → WordPiece
- GPT-style → BPE
- T5-style → Unigram
-
Use byte-level for multilingual:
- Handles any Unicode
- No unknown tokens
-
Test on representative data:
- Measure compression ratio
- Check unknown token rate
- Inspect sample tokenizations
-
Version control tokenizers:
- Save with model
- Document special tokens
- Track vocabulary changes