The Hidden Cost of Tokenization

TL;DR: This is a short summary of our position paper The Hidden Cost of Tokenization: Why (most) Non-English Speakers Pay More for Less by Jennifer Haase and Sebastian Pokutta. The basic point is simple: tokenization is not a neutral preprocessing step. It determines how much users pay, how much context they get, how much compute is burned, and potentially how well a model can reason in a language. The same semantic content can require 1.3x, 5x, or even more than 10x as many tokens depending on the language-tokenizer pairing.

The invisible tax

Most users never see the tokenizer. They see an input box, a model name, maybe a price per million tokens, and then a response. Somewhere between the text and the model, however, the text is chopped into model-readable units. Those units are what commercial APIs count and charge for. They are also what determines how much of a document fits into a context window and how much work the transformer has to do.

This would be mostly harmless if tokenizers were language-neutral. They are not. A tokenizer trained mostly on English-like text learns English-like chunks: common words, common morphemes, common byte patterns, common whitespace structure. When this tokenizer sees a language with a different script, different morphology, or simply much less representation in the training corpus, the same meaning can fragment into many more tokens.

In the paper we make the following point:

Key point. Tokenizer design is a first-order concern for fair and efficient multilingual AI.

This sounds like a technical detail until you remember that the entire business model of many LLM services is token-metered. If English needs 100 tokens and Arabic, Bengali, Burmese, Amharic, or Dzongkha need substantially more tokens for the same content, then non-English users are paying a hidden language tax. Not because their tasks are harder, but because the infrastructure has encoded a preference. However it is not only about cost but also about capability.

What tokenization actually does

Modern LLMs do not process text as characters or words. They process integer IDs. A tokenizer maps a string to those IDs. Most common systems use variants of byte pair encoding (BPE), unigram tokenization, or related subword schemes. In a BPE-like tokenizer, frequent character sequences are merged into larger units. Very frequent chunks become single tokens; rare chunks remain fragmented.

This is a reasonable compression trick, but it has an uncomfortable consequence: the tokenizer’s training distribution becomes part of the model’s access layer. If the tokenizer sees English constantly, “tokenization” may be a single token or a small number of tokens. If it sees a morphologically rich or underrepresented language rarely, it may fall back to smaller pieces, sometimes even byte-like fragments.

The choice of this “fragmentation” impacts four things at the same time:

Cost. API pricing is usually per input and output token.
Latency. More tokens mean more transformer steps over a longer sequence.
Effective context. A 128k-token window is not the same semantic window in every language.
Representational quality. Poor boundaries can cut across meaningful units, forcing the model to reconstruct meaning from pieces that were not designed to stand alone.

The first three points are already enough to make tokenization a fairness issue. The fourth is more subtle, but at the same time maybe the most interesting one.

Same meaning, different budget

To understand the impact, the (deliberately simple) widget allows to compare different scenarios. Pick a reported language-tokenizer situation and compare it with an English baseline. The multipliers are rounded examples from the paper and the literature we discuss in our paper [HP26]; they should be read as order-of-magnitude signals or indications, while the exact overhead heavily depends on scenarios, use and exact tokenizer. Identical context is assumed and was benchmarked/calibrated from multilingual corpora.

What this reveals is that while many models have “multilingual support”, without without qualification it does not tell us how “well” a target language is supported. A model may technically accept a language and still its output might be subpar. If a language needs five times as many tokens, it receives one fifth of the effective context window, costs five times as much to serve, and uses roughly five times as much token-side inference work for the same amount of content. Support is not the same as parity.

Not just a billing problem

The cost issue is easy to understand and therefore easy to trivialize: yes, some users pay more, but perhaps this is just how compression works? There is some truth in that. Languages are different. Scripts differ. Morphology differs. Some languages are more compact in characters, some less. A fixed vocabulary is a scarce resource. A tokenizer cannot allocate the same vocabulary slots to English, German compounds, Chinese characters, Arabic morphology, Indic scripts, and every low-resource language simultaneously. But precisely because it is a limited resource, design choice matters.

In our paper we compare three tokenizer families: OpenAI’s GPT o200k tokenizer, Alibaba’s Qwen tokenizer, and EuroLLM’s tokenizer [HP26]. The patterns are quite revealing: The English-adjacent tokenizer is excellent for English and much worse for many other languages. Qwen shows that targeted optimization can make Chinese highly efficient while remaining competitive for English. EuroLLM is a cautionary case: broad multilingual ambition does not automatically imply efficient multilingual representation; it basically underperforms in all its languages simultaneously when it comes to tokenization efficiency.

I would like to stress that this is not about one tokenizer being morally good and the other one being morally bad. Rather the point is that tokenization outcomes are design outcomes. Chinese under a Chinese-aware tokenizer and Chinese under an English-adjacent tokenizer are not the same experience. That means the disparity is not a law of nature. It is a result of vocabulary allocation, data mixture, encoding choices, and what the system was optimized to do.

Fragmentation as cognitive friction

So far we have only looked at the question of costs and efficiency. However there is a second component that we suspect being even more important although currently still a bit of speculative part of the paper. We call this cognitive friction. This is not meant as a mystical claim that LLMs think like humans. It is an analogy: if a representation cuts through the meaningful units of a language, the model has to reconstruct meaning through a worse interface, which impairs performance.

From a Sapir-Whorfian perspective, tokenization that fragments non-English languages through English-centric segmentation is not merely inefficient, it is a form of linguistic imperialism encoded in infrastructure.

For English, a tokenizer may learn convenient chunks such as “reasoning”, “tokenization”, “pre”, “-ing”, or common whitespace-prefixed words. For Turkish, Finnish, Arabic, Ukrainian, or many Indic languages, meaningful morphology can be distributed differently. For Chinese and Japanese, word boundaries do not behave like whitespace-separated English words. For byte-level fallback behavior, non-Latin scripts can be represented through byte fragments that are very far from the linguistic units a speaker would recognize.

Interlude. Coincidentally, after we wrote our paper, Anthropic changed the tokenizer for Claude Opus 4.7 compared to 4.6; [Anthropic’s Claude Opus 4.7 announcement], [efficiency comparison]. It became much less efficient, same input requires roughly 1.3x amount of tokens, but was meant to improve instruction following. Such a move is somewhat unusual as it “breaks” compatibility across model stacks etc. Now, the opinion on Opus 4.7 are somehwat mixed with many reporting it being a regression from 4.6. This may or may not be related to the tokenizer change but is certainly a thought to entertain.

The widget below schematicatically demonstrates this; note this is not an actual tokenizer emulator. It is meant to make the representational issue visible: the same string can be broken into fragments that either align reasonably with semantic units or cut across them awkwardly.

Why does this matter for reasoning? One very mundane reason is sequence length. Longer sequences are harder and more expensive to process, attention needs to span more tokens, and long-context models are not uniformly reliable across positions; the “lost in the middle” phenomenon is one example [LLHPBPL24]. But there is also a representational reason: if the model consistently sees a language through awkward subword fragments, then the basic units of prediction and attention are misaligned with the units that carry meaning.

There is already quite a bit of evidence in this direction. Prior work documents cost disparities across languages [AKGKMT23], tokenizer parity failures [BLPT23], performance differences from tokenizer choice alone [AFTRL24], and even arithmetic failures caused by number-tokenization choices [SS24]. To be clear we do not claim that tokenization is the only thing that matters, however these observations fit nicely into one argument: tokenization sits early enough in the stack that its consequences compound.

The double-jeopardy pattern

A poorly served language does not merely cost more. It can also receive less utility for the cost: this is known as double jeopardy [SVKD24].

Suppose an English user and a Bengali user both have access to an 8,000-token context window. If the Bengali text needs roughly six times as many tokens for comparable content under a given tokenizer, then the Bengali user has about one sixth of the usable semantic budget. Fewer examples fit. Less document context fits. Longer reasoning traces hit the limit sooner. If latency matters, the Bengali user also waits longer. If the provider pays for compute, the provider has an incentive to serve that user less aggressively or charge more.

This is where the fairness issue becomes embedeed into infrastructure: not a single subpar output, but rather a quiet degradation of service quality mediated through pricing, context length, latency, and model internals.

The obvious counterarguments

There are two reasonable objections.

First, languages really are different. A sentence is not a unit of equal information across languages, and a word is even worse. Some translations are longer than others. Some scripts encode differently in Unicode. Some languages have rich morphology. So a perfect one-tokenization-fits-all parity metric is too naive.

I agree. But that is an argument for measuring carefully, not for ignoring the gap. Parallel corpora are not perfect, but they are good enough to reveal large disparities. A 10x or 15x premium is not explained away by translation style.

Second, tokenization is compression under a fixed vocabulary budget. If you give more vocabulary to one language, you take it away from another. Again, true. But the relevant question is not whether tradeoffs exist. The question is whether today’s tradeoffs are defensible, transparent, and aligned with the users who bear their costs or receive subpar model performance.

There are already promising directions: parity-aware BPE [FMPNABS25], adaptive tokenization [CGWXZZZ24], language-specific tokenizers, tokenizer replacement or extension, and language-specific foundation models. None of these is free. All of them force us to be explicit about the allocation problem.

What should change

At minimum, model providers should report tokenizer efficiency, not just benchmark scores. A model card that says “supports 100+ languages” should also tell us the token multiplier by language, by domain, and preferably by script and dialectal variation. This is simple benchmarking across parallel corpora. Otherwise “support” hides a large cost-quality gradient.

For research papers, I would like to see “tokenizer fertility” and effective-context ratios become standard diagnostics for multilingual experiments. Accuracy without token count is incomplete. If two models get the same score, but one needs four times as many tokens in Arabic, they are not equivalent systems from a user perspective.

For deployments and use, the practical advice is simple:

Audit real token counts before launching a multilingual product.
Compare providers by language, not only by headline model quality.
Treat high-volume non-English use cases as tokenizer-sensitive infrastructure decisions.
Budget by task and language, not just by global average tokens.
Consider language-specific or language-adaptive models when the multiplier is large.

For the community more broadly: stop treating the tokenizer as an implementation footnote. It is part of the model’s interface to language, and therefore part of the product, the economics, and the fairness story.

And the last, most obvious, and most powerful one for the user: use llms only in english (or chinese). This provides the most bang for the buck.

So where do we go from here

The optimistic reading is that this problem is fixable. The Qwen example shows that targeted design can substantially improve efficiency for a major non-English language. Work on parity-aware and adaptive tokenization suggests that we are not at a theoretical wall [FMPNABS25], [CGWXZZZ24]. Better reporting alone would already change incentives, because the language tax would become visible.

The pessimistic reading is that visibility is exactly what the current ecosystem lacks. Users do not choose tokenizers. They choose models and providers. The tokenizer is bundled into the system, and the bill arrives in tokens. That makes this sublety easy to miss.

References

[HP26] Haase, J. & Pokutta, S. (2026). The Hidden Cost of Tokenization: Why (most) Non-English Speakers Pay More for Less. Zenodo preprint. doi

[AKGKMT23] Ahia, O., Kumar, S., Gonen, H., Kasai, J., Mortensen, D.R., Smith, N.A., & Tsvetkov, Y. (2023). Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. EMNLP 2023. paper

[AFTRL24] Ali, M., Fromm, M., Thellmann, K., Rutmann, R., Lübbering, M., Leveling, J., Klug, K., Ebert, J., Doll, N., Schulze Buschhoff, J., et al. (2024). Tokenizer Choice For LLM Training: Negligible or Crucial? Findings of NAACL 2024. paper

[FMPNABS25] Foroutan, N., Meister, C., Paul, D., Niklaus, J., Ahmadi, S., Bosselut, A., & Sennrich, R. (2025). Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization. arXiv preprint. arxiv

[LLHPBPL24] Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. doi

[BLPT23] Bibi, A., La Malfa, E., Petrov, A., & Torr, P. (2023). Language Model Tokenizers Introduce Unfairness Between Languages. NeurIPS 2023. paper

[SS24] Singh, A.K. & Strouse, D.J. (2024). Tokenization Counts: The Impact of Tokenization on Arithmetic in Frontier LLMs. arXiv preprint. arxiv

[SVKD24] Solatorio, A.V., Vicente, G.S., Krambeck, H., & Dupriez, O. (2024). Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers. arXiv preprint. arxiv

[CGWXZZZ24] Chen, H., Guo, T., Wang, Y., Xu, C., Zheng, B., Zheng, M., & Zhu, C. (2024). Enhancing Large Language Models through Adaptive Tokenizers. NeurIPS 2024. paper

The invisible tax

What tokenization actually does

Same meaning, different budget

Not just a billing problem

Fragmentation as cognitive friction

The double-jeopardy pattern

The obvious counterarguments

What should change

So where do we go from here

References

Comments