allegro/allegro-checkpoint

Fork 0

Files

Allegro 13fca8ebea Checkpoint: Allegro state pre-migration

2026-04-01 11:04:00 +00:00

15 KiB

Raw Blame History

Tokenization Algorithms Deep Dive

Comprehensive explanation of BPE, WordPiece, and Unigram algorithms.

Byte-Pair Encoding (BPE)

Algorithm overview

BPE iteratively merges the most frequent pair of tokens in a corpus.

Training process:

Initialize vocabulary with all characters
Count frequency of all adjacent token pairs
Merge most frequent pair into new token
Add new token to vocabulary
Update corpus with new token
Repeat until vocabulary size reached

Step-by-step example

Corpus:

low: 5
lower: 2
newest: 6
widest: 3

Iteration 1:

Count pairs:
'e' + 's': 9 (newest: 6, widest: 3)  ← most frequent
'l' + 'o': 7
'o' + 'w': 7
...

Merge: 'e' + 's' → 'es'

Updated corpus:
low: 5
lower: 2
newest: 6 → newes|t: 6
widest: 3 → wides|t: 3

Vocabulary: [a-z] + ['es']

Iteration 2:

Count pairs:
'es' + 't': 9  ← most frequent
'l' + 'o': 7
...

Merge: 'es' + 't' → 'est'

Updated corpus:
low: 5
lower: 2
newest: 6 → new|est: 6
widest: 3 → wid|est: 3

Vocabulary: [a-z] + ['es', 'est']

Continue until desired vocabulary size...

Tokenization with trained BPE

Given vocabulary: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', 'es', 'est', 'lo', 'low', 'ne', 'new', 'newest', 'wi', 'wid', 'widest']

Tokenize "lowest":

Step 1: Split into characters
['l', 'o', 'w', 'e', 's', 't']

Step 2: Apply merges in order learned during training
- Merge 'l' + 'o' → 'lo' (if this merge was learned)
- Merge 'lo' + 'w' → 'low' (if learned)
- Merge 'e' + 's' → 'es' (learned)
- Merge 'es' + 't' → 'est' (learned)

Final: ['low', 'est']

Implementation

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Initialize
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

# Configure trainer
trainer = BpeTrainer(
    vocab_size=1000,
    min_frequency=2,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

# Train
corpus = [
    "This is a sample corpus for BPE training.",
    "BPE learns subword units from the training data.",
    # ... more sentences
]

tokenizer.train_from_iterator(corpus, trainer=trainer)

# Use
output = tokenizer.encode("This is tokenization")
print(output.tokens)  # ['This', 'is', 'token', 'ization']

Byte-level BPE (GPT-2 variant)

Problem: Standard BPE has limited character coverage (256+ Unicode chars)

Solution: Operate on byte level (256 bytes)

from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.decoders import ByteLevel as ByteLevelDecoder

tokenizer = Tokenizer(BPE())

# Byte-level pre-tokenization
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()

# This handles ALL possible characters, including emojis
text = "Hello 🌍 世界"
tokens = tokenizer.encode(text).tokens

Advantages:

Handles any Unicode character (256 byte coverage)
No unknown tokens (worst case: bytes)
Used by GPT-2, GPT-3, BART

Trade-offs:

Slightly worse compression (bytes vs characters)
More tokens for non-ASCII text

BPE variants

SentencePiece BPE:

Language-independent (no pre-tokenization)
Treats input as raw byte stream
Used by T5, ALBERT, XLNet

Robust BPE:

Dropout during training (randomly skip merges)
More robust tokenization at inference
Reduces overfitting to training data

WordPiece

Algorithm overview

WordPiece is similar to BPE but uses a different merge selection criterion.

Training process:

Initialize vocabulary with all characters
Count frequency of all token pairs
Score each pair: score = freq(pair) / (freq(first) × freq(second))
Merge pair with highest score
Repeat until vocabulary size reached

Why different scoring?

BPE: Merges most frequent pairs

"aa" appears 100 times → high priority
Even if 'a' appears 1000 times alone

WordPiece: Merges pairs that are semantically related

"aa" appears 100 times, 'a' appears 1000 times → low score (100 / (1000 × 1000))
"th" appears 50 times, 't' appears 60 times, 'h' appears 55 times → high score (50 / (60 × 55))
Prioritizes pairs that appear together more than expected

Step-by-step example

Corpus:

low: 5
lower: 2
newest: 6
widest: 3

Iteration 1:

Count frequencies:
'e': 11 (lower: 2, newest: 6, widest: 3)
's': 9
't': 9
...

Count pairs:
'e' + 's': 9 (newest: 6, widest: 3)
'es' + 't': 9 (newest: 6, widest: 3)
...

Compute scores:
score('e' + 's') = 9 / (11 × 9) = 0.091
score('es' + 't') = 9 / (9 × 9) = 0.111  ← highest score
score('l' + 'o') = 7 / (7 × 9) = 0.111   ← tied

Choose: 'es' + 't' → 'est' (or 'lo' if tied)

Key difference: WordPiece prioritizes rare combinations over frequent ones.

Tokenization with WordPiece

Given vocabulary: ['##e', '##s', '##t', 'l', 'o', 'w', 'new', 'est', 'low']

Tokenize "lowest":

Step 1: Find longest matching prefix
'lowest' → 'low' (matches)

Step 2: Find longest match for remainder
'est' → 'est' (matches)

Final: ['low', 'est']

If no match:

Tokenize "unknownword":
'unknownword' → no match
'unknown' → no match
'unkn' → no match
'un' → no match
'u' → no match
→ [UNK]

Implementation

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.normalizers import BertNormalizer
from tokenizers.pre_tokenizers import BertPreTokenizer

# Initialize BERT-style tokenizer
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# Normalization (lowercase, accent stripping)
tokenizer.normalizer = BertNormalizer(lowercase=True)

# Pre-tokenization (whitespace + punctuation)
tokenizer.pre_tokenizer = BertPreTokenizer()

# Configure trainer
trainer = WordPieceTrainer(
    vocab_size=30522,  # BERT vocab size
    min_frequency=2,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    continuing_subword_prefix="##"  # BERT uses ##
)

# Train
tokenizer.train_from_iterator(corpus, trainer=trainer)

# Use
output = tokenizer.encode("Tokenization works great!")
print(output.tokens)  # ['token', '##ization', 'works', 'great', '!']

Subword prefix

BERT uses ## prefix:

"unbelievable" → ['un', '##believ', '##able']

Why?

Indicates token is a continuation
Allows reconstruction: remove ##, concatenate
Helps model distinguish word boundaries

WordPiece advantages

Semantic merges:

Prioritizes meaningful combinations
"qu" has high score (always together)
"qx" has low score (rare combination)

Better for morphology:

Captures affixes: un-, -ing, -ed
Preserves word stems

Trade-offs:

Slower training than BPE
More memory (stores vocabulary, not merges)
Original implementation not open-source (HF reimplementation)

Unigram

Algorithm overview

Unigram works backward: start with large vocabulary, remove tokens.

Training process:

Initialize with large vocabulary (all substrings)
Estimate probability of each token (frequency-based)
For each token, compute loss increase if removed
Remove 10-20% of tokens with lowest loss impact
Re-estimate probabilities
Repeat until desired vocabulary size

Probabilistic tokenization

Unigram assumption: Each token is independent.

Given vocabulary with probabilities:

P('low') = 0.02
P('l') = 0.01
P('o') = 0.015
P('w') = 0.01
P('est') = 0.03
P('e') = 0.02
P('s') = 0.015
P('t') = 0.015

Tokenize "lowest":

Option 1: ['low', 'est']
P = P('low') × P('est') = 0.02 × 0.03 = 0.0006

Option 2: ['l', 'o', 'w', 'est']
P = 0.01 × 0.015 × 0.01 × 0.03 = 0.000000045

Option 3: ['low', 'e', 's', 't']
P = 0.02 × 0.02 × 0.015 × 0.015 = 0.0000009

Choose option 1 (highest probability)

Viterbi algorithm

Finding best tokenization is expensive (exponential possibilities).

Viterbi algorithm (dynamic programming):

def tokenize_viterbi(word, vocab, probs):
    n = len(word)
    # dp[i] = (best_prob, best_tokens) for word[:i]
    dp = [{} for _ in range(n + 1)]
    dp[0] = (0.0, [])  # log probability

    for i in range(1, n + 1):
        best_prob = float('-inf')
        best_tokens = []

        # Try all possible last tokens
        for j in range(i):
            token = word[j:i]
            if token in vocab:
                prob = dp[j][0] + log(probs[token])
                if prob > best_prob:
                    best_prob = prob
                    best_tokens = dp[j][1] + [token]

        dp[i] = (best_prob, best_tokens)

    return dp[n][1]

Time complexity: O(n² × vocab_size) vs O(2^n) brute force

Implementation

from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

# Initialize
tokenizer = Tokenizer(Unigram())

# Configure trainer
trainer = UnigramTrainer(
    vocab_size=8000,
    special_tokens=["<unk>", "<s>", "</s>"],
    unk_token="<unk>",
    max_piece_length=16,      # Max token length
    n_sub_iterations=2,       # EM iterations
    shrinking_factor=0.75     # Remove 25% each iteration
)

# Train
tokenizer.train_from_iterator(corpus, trainer=trainer)

# Use
output = tokenizer.encode("Tokenization with Unigram")
print(output.tokens)  # ['▁Token', 'ization', '▁with', '▁Un', 'igram']

Unigram advantages

Probabilistic:

Multiple valid tokenizations
Can sample different tokenizations (data augmentation)

Subword regularization:

# Sample different tokenizations
for _ in range(3):
    tokens = tokenizer.encode("tokenization", is_pretokenized=False).tokens
    print(tokens)

# Output (different each time):
# ['token', 'ization']
# ['tok', 'en', 'ization']
# ['token', 'iz', 'ation']

Language-independent:

No word boundaries needed
Works for CJK languages (Chinese, Japanese, Korean)
Treats input as character stream

Trade-offs:

Slower training (EM algorithm)
More hyperparameters
Larger model (stores probabilities)

Algorithm comparison

Training speed

Algorithm	Small (10MB)	Medium (100MB)	Large (1GB)
BPE	10-15 sec	1-2 min	10-20 min
WordPiece	15-20 sec	2-3 min	15-30 min
Unigram	20-30 sec	3-5 min	30-60 min

Tested on: 16-core CPU, 30k vocab

Tokenization quality

Tested on English Wikipedia (perplexity measurement):

Algorithm	Vocab Size	Tokens/Word	Unknown Rate
BPE	30k	1.3	0.5%
WordPiece	30k	1.2	1.2%
Unigram	8k	1.5	0.3%

Key observations:

WordPiece: Slightly better compression
BPE: Lower unknown rate
Unigram: Smallest vocab, good coverage

Compression ratio

Characters per token (higher = better compression):

Language	BPE (30k)	WordPiece (30k)	Unigram (8k)
English	4.2	4.5	3.8
Chinese	2.1	2.3	2.5
Arabic	3.5	3.8	3.2

Best for each:

English: WordPiece
Chinese: Unigram (language-independent)
Arabic: WordPiece

Use case recommendations

BPE - Best for:

English language models
Code (handles symbols well)
Fast training needed
Models: GPT-2, GPT-3, RoBERTa, BART

WordPiece - Best for:

Masked language modeling (BERT-style)
Morphologically rich languages
Semantic understanding tasks
Models: BERT, DistilBERT, ELECTRA

Unigram - Best for:

Multilingual models
Languages without word boundaries (CJK)
Data augmentation via subword regularization
Models: T5, ALBERT, XLNet (via SentencePiece)

Advanced topics

Handling rare words

BPE approach:

"antidisestablishmentarianism"
→ ['anti', 'dis', 'establish', 'ment', 'arian', 'ism']

WordPiece approach:

"antidisestablishmentarianism"
→ ['anti', '##dis', '##establish', '##ment', '##arian', '##ism']

Unigram approach:

"antidisestablishmentarianism"
→ ['▁anti', 'dis', 'establish', 'ment', 'arian', 'ism']

Handling numbers

Challenge: Infinite number combinations

BPE solution: Byte-level (handles any digit sequence)

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = ByteLevel()

# Handles any number
"123456789" → byte-level tokens

WordPiece solution: Digit pre-tokenization

from tokenizers.pre_tokenizers import Digits

# Split digits individually or as groups
tokenizer.pre_tokenizer = Digits(individual_digits=True)

"123" → ['1', '2', '3']

Unigram solution: Learns common number patterns

# Learns patterns during training
"2023" → ['202', '3'] or ['20', '23']

Handling case sensitivity

Lowercase (BERT):

from tokenizers.normalizers import Lowercase

tokenizer.normalizer = Lowercase()

"Hello WORLD" → "hello world" → ['hello', 'world']

Preserve case (GPT-2):

# No case normalization
tokenizer.normalizer = None

"Hello WORLD" → ['Hello', 'WORLD']

Cased tokens (RoBERTa):

# Learns separate tokens for different cases
Vocabulary: ['Hello', 'hello', 'HELLO', 'world', 'WORLD']

Handling emojis and special characters

Byte-level (GPT-2):

tokenizer.pre_tokenizer = ByteLevel()

"Hello 🌍 👋" → byte-level representation (always works)

Unicode normalization:

from tokenizers.normalizers import NFKC

tokenizer.normalizer = NFKC()

"é" (composed) ↔ "é" (decomposed) → normalized to one form

Troubleshooting

Issue: Poor subword splitting

Symptom:

"running" → ['r', 'u', 'n', 'n', 'i', 'n', 'g']  (too granular)

Solutions:

Increase vocabulary size
Train longer (more merge iterations)
Lower min_frequency threshold

Issue: Too many unknown tokens

Symptom:

5% of tokens are [UNK]

Solutions:

Increase vocabulary size
Use byte-level BPE (no UNK possible)
Verify training corpus is representative

Issue: Inconsistent tokenization

Symptom:

"running" → ['run', 'ning']
"runner" → ['r', 'u', 'n', 'n', 'e', 'r']

Solutions:

Check normalization consistency
Ensure pre-tokenization is deterministic
Use Unigram for probabilistic variance

Best practices

Match algorithm to model architecture:
- BERT-style → WordPiece
- GPT-style → BPE
- T5-style → Unigram
Use byte-level for multilingual:
- Handles any Unicode
- No unknown tokens
Test on representative data:
- Measure compression ratio
- Check unknown token rate
- Inspect sample tokenizations
Version control tokenizers:
- Save with model
- Document special tokens
- Track vocabulary changes

15 KiB Raw Blame History Unescape Escape

Tokenization Algorithms Deep Dive

Byte-Pair Encoding (BPE)

Algorithm overview

Step-by-step example

Tokenization with trained BPE

Implementation

Byte-level BPE (GPT-2 variant)

BPE variants

WordPiece

Algorithm overview

Why different scoring?

Step-by-step example

Tokenization with WordPiece

Implementation

Subword prefix

WordPiece advantages

Unigram

Algorithm overview

Probabilistic tokenization

Viterbi algorithm

Implementation

Unigram advantages

Algorithm comparison

Training speed

Tokenization quality

Compression ratio

Use case recommendations

Advanced topics

Handling rare words

Handling numbers

Handling case sensitivity

Handling emojis and special characters

Troubleshooting

Issue: Poor subword splitting

Issue: Too many unknown tokens

Issue: Inconsistent tokenization

Best practices

15 KiB

Raw Blame History