Tokenisation

LLMs read tokens not text — BPE algorithm, tiktoken and Anthropic tokenisers, non-English cost penalty, and context window budgeting at production scale.

LLMs don't read text. They read tokens. Tokenisation is the step that converts text into integer IDs the model can process. Understanding it matters for: cost estimation, context window budgeting, prompt design, and explaining model failures.


What Is a Token?

A token is roughly 3-4 characters of English text. Rules of thumb:

  • 1 token ≈ 4 characters ≈ 0.75 words
  • 100 tokens ≈ 75 words ≈ a short paragraph
  • 1,000 tokens ≈ 750 words ≈ a page of text
  • 1M tokens ≈ 750K words ≈ a large novel

These are averages. Code tokenises worse than prose (identifiers, operators). Non-Latin scripts tokenise much worse. A single CJK or Arabic character may be 1-4 tokens.


Byte-Pair Encoding (BPE)

The dominant tokenisation algorithm. BPE builds a vocabulary by iteratively merging the most frequent byte pairs in a corpus.

Algorithm:

  1. Start with individual bytes (or characters) as the vocabulary
  2. Count all adjacent pairs in the training corpus
  3. Merge the most frequent pair into a new token
  4. Repeat until vocabulary size is reached (typically 32K-200K tokens)
Training corpus: "low low low lowest newest"

Initial: ['l','o','w',' ','l','o','w','e','s','t',' ','n','e','w','e','s','t']

Step 1: 'lo' is most frequent → merge → ['lo','w',' ','lo','w','e','s','t',' ','n','e','w','e','s','t']
Step 2: 'low' is most frequent → merge → ['low',' ','low','e','s','t',' ','n','e','w','e','s','t']
...

Common words become single tokens. Rare words split into subword pieces. Unknown words never cause failures. They decompose into bytes.


tiktoken

OpenAI's tokeniser library. Fast, Rust-backed, available in Python.

import tiktoken

# GPT-4, GPT-4o
enc = tiktoken.get_encoding("cl100k_base")

# GPT-4o-mini, o1
enc = tiktoken.get_encoding("o200k_base")

text = "Hello, how many tokens is this?"
tokens = enc.encode(text)
print(tokens)          # [9906, 11, 1268, 1690, 11460, 374, 420, 30]
print(len(tokens))     # 8

decoded = enc.decode(tokens)
print(decoded)         # "Hello, how many tokens is this?"

# Count without tokenising (faster)
def count_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

Anthropic Tokeniser

Claude uses a custom BPE tokeniser. Count tokens via the API:

import anthropic

client = anthropic.Anthropic()
response = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "How many tokens is this message?"}],
)
print(response.input_tokens)  # exact count

Or use the anthropic-tokenizer package for offline counting.


Why Tokenisation Explains Model Behaviour

Arithmetic failures

"1 + 1 = " → tokens: ['1', ' +', ' 1', ' =', ' ']
"999 + 1 = " → tokens: ['999', ' +', ' 1', ' =', ' ']  — '999' is one token
"9999999 + 1 = " → could split as ['9', '999', '999', ' +', ...] — inconsistent representation

Multi-digit arithmetic is hard because numbers tokenise inconsistently. The model learns patterns on whatever token boundaries happen to exist.

Reversal tasks

"Reverse the string hello" is easy. "Reverse helloworld" may fail. helloworld might be a single token, and the model can't see its internal characters.

Non-English text costs more

enc = tiktoken.get_encoding("cl100k_base")
english = "The quick brown fox"
chinese = "敏捷的棕色狐狸"    # same meaning

print(len(enc.encode(english)))   # 4 tokens
print(len(enc.encode(chinese)))   # 12 tokens — 3x more expensive

Special Tokens

Every model adds control tokens to the vocabulary:

TokenPurpose
<|endoftext|>GPT: end of sequence
<|im_start|> / <|im_end|>ChatML format: message boundaries
<s> / </s>Llama: start/end of sequence
[INST] / [/INST]Mistral instruction format
<|begin_of_text|>Llama 3 BOS token

These tokens are invisible in normal chat but critical in fine-tuning and when building raw prompts.


Context Window Budgeting

For a 200K-token context window:

  • System prompt: 500-2,000 tokens
  • Conversation history: scales with turns
  • Retrieved docs (RAG): 2,000-20,000 tokens
  • Tools/function schemas: 200-500 tokens per tool
  • Output: 1,000-4,000 tokens reserved
def build_prompt_within_budget(
    system: str,
    history: list[dict],
    context: list[str],
    max_tokens: int = 150_000,  # leave 50K headroom in 200K model
) -> list[dict]:
    system_tokens = count_tokens(system)
    available = max_tokens - system_tokens - 2_000  # reserve for output
    
    # Fit context first (truncate if needed)
    context_text = "\n\n".join(context)
    if count_tokens(context_text) > available // 2:
        # Truncate to fit
        context_text = truncate_to_tokens(context_text, available // 2)
    
    # Then fit history (drop oldest turns if needed)
    # ...

Token Costs at Scale

1 billion API calls per day × 500 input tokens average = 500B tokens/day. At $3/M tokens (Sonnet 4.6): $1.5M/day.

Prompt caching (Anthropic) reduces repeated prefixes to 0.1x cost. A 10,000-token system prompt cached across 1M calls saves ~$27K at Sonnet pricing.


Key Facts

  • 1 token ≈ 4 characters ≈ 0.75 words for English prose
  • CJK and Arabic scripts: 1 character = 1-4 tokens — 3x more expensive than equivalent English
  • BPE vocabulary sizes: typically 32K-200K tokens built from frequency merges
  • tiktoken encodings: cl100k_base (GPT-4/GPT-4o), o200k_base (GPT-4o-mini/o1)
  • Anthropic token counting: client.messages.count_tokens() gives exact count before the API call
  • 10,000-token system prompt cached across 1M calls saves ~$27K at Sonnet 4.6 pricing ($3/M)
  • Multi-digit arithmetic fails partly because numbers tokenise inconsistently across models
  • Context window budgeting: tools add 200-500 tokens each; reserve 1,000-4,000 for output

Common Failure Cases

Context window budget calculation uses character count instead of token count, causing silent truncation Why: a 200K-token limit is not a 200K-character limit; English prose averages 4 characters per token, but code and structured text are denser, so a character-based guard underestimates token usage and the request is silently truncated or rejected with a 413 error. Detect: API calls succeed for English prose but fail with prompt_too_long for code-heavy requests of the same character length; checking client.messages.count_tokens() before submission reveals the actual count is well above the character-based estimate. Fix: always count tokens using client.messages.count_tokens() (Anthropic) or tiktoken.encode() (OpenAI) before every API call in production; never use len(text) / 4 as the production guard.

Non-English user query costs 3-5x more tokens than expected, blowing per-request cost budgets Why: CJK, Arabic, and other non-Latin scripts tokenise at 1-4 tokens per character with BPE vocabularies trained predominantly on English; a 200-word Japanese message may cost as many tokens as a 600-word English message. Detect: per-request token cost spikes for non-English users; count_tokens() output is 3-5x the word count for the same semantic content in other languages. Fix: measure actual token counts for your target languages during system design; apply per-language cost multipliers in your budget calculations; consider whether a model with a multilingual-optimised vocabulary is more cost-efficient for the use case.

Prompt caching breaks because a model version upgrade changes the tokeniser, invalidating all cached prefixes Why: different model versions may use different BPE vocabularies; the same text can tokenise to different token IDs across model versions, so cached prefixes are incompatible across the version boundary. Detect: cache hit rate drops to 0% immediately after a model version change; checking count_tokens() before and after the upgrade shows different counts for identical text. Fix: treat model upgrades as cache-invalidating events; warm the cache explicitly after upgrading by making one call with each major cached prefix before routing live traffic.

Tool/function schemas consume far more tokens than expected when many tools are bound to an agent Why: each tool's JSON schema (name, description, parameter names, types, and descriptions) is injected into the context on every call; 10 tools with detailed schemas can add 3,000-5,000 tokens per request. Detect: count_tokens() output is significantly higher than expected for a short user message; diffing the token count with and without tool bindings shows the tool schemas are the dominant cost. Fix: only bind the tools the agent needs for the current task step (dynamic tool loading); write concise tool descriptions (1-2 sentences maximum); remove optional parameter descriptions that don't materially help the model.

Connections

Open Questions

  • Does Anthropic's tokeniser handle code identifiers better or worse than tiktoken cl100k_base?
  • How do vocabulary size differences (32K vs 200K) affect multilingual model quality vs token efficiency?
  • Will character-level or byte-level models eventually replace BPE as the dominant approach?