The LLM inference tax nobody talks about

Every request recomputes everything. I built an open-source fix for the cases prefix caching misses.

Srihari Unnikrishnan · Independent research · 2026

Every LLM request at your company recomputes everything. System prompt, retrieved documents, few-shot examples — the model reads them fresh, every time. Doesn't matter if those exact tokens showed up in the last million requests. No memory. Full price.

The industry has a partial answer: prefix caching. If requests share a leading prefix token-for-token, vLLM and SGLang will skip that recomputation. It works well when your shared content always sits at the very front. It does nothing when a retrieved document shows up at position 200, or when a system prompt gets shuffled by dynamic context injection, or when two users share common boilerplate that happens not to be the very first tokens.

I got tired of paying the tax where prefix caching doesn't apply. So I built KVBoost. Source: github.com/pythongiant/kvboost, MIT licensed, open for contributors.

The idea

Split the prompt into 128-token chunks. Give each chunk two hash keys — one encoding content plus positional context (for exact reuse), one encoding content alone (for approximate reuse regardless of where it appears in the prompt). When a request arrives, check both stores before touching the GPU for those tokens.

Chunk-level KV cache reuse: prefix caching vs KVBoost Left shows prefix caching failing after a mid-prompt miss. Right shows KVBoost reusing non-contiguous chunks. Prefix caching KVBoost C1 hit C2 hit C3 miss C4 blocked miss breaks the chain C4 blocked even if cached elsewhere 2 / 4 chunks reused C1 hit C2 miss C3 hit C4 ≈ hit gap doesn't block subsequent chunks 3 / 4 chunks reused Exact hit Approx hit Miss / blocked
Figure 1. Prefix caching (left) fails when a miss appears mid-prompt — all subsequent chunks are blocked even if their content is cached elsewhere. KVBoost (right) reuses any matching chunk regardless of position, including approximate content matches (C4).

The result: 4.49× mean time-to-first-token speedup over full recomputation, and 16% faster than vLLM prefix caching, on 1,000 samples from a bug-localization benchmark using Qwen2.5-3B. Exact-match accuracy: 99.2% vs 99.1% for both baselines.

4.49×
speedup vs. full recompute (mean TTFT)
16%
faster than vLLM prefix caching
99.2%
exact-match accuracy, vs 99.1% baseline

The hard part

Two problems nearly killed this.

The first is attention seams. When you stitch cached chunks together, boundary tokens attended to a different preceding context when they were originally encoded. Reusing those KV tensors verbatim produces silent attention errors — the kind that don't crash, don't appear in logs, and quietly degrade output quality.

I implemented two repair strategies. The spatial one (SelectiveRecompute) re-encodes the last 16 tokens before each seam. The deviation-guided one (CacheBlendRecompute, adapted from the CacheBlend paper) does a cheap probe pass, measures per-token cosine deviation, and recomputes only the tokens that actually changed — about 15% of the prompt. The deviation-guided approach catches mid-chunk errors the spatial window misses entirely, and it's mandatory for any approximate content match.

Seam error and repair at chunk boundaries Two cached chunks joined at a seam have incorrect KV tensors near the boundary. The repair window spans both sides of the seam and is re-encoded with full context. Assembling two cached chunks creates seam error at the boundary C1 cached KV tensors C2 cached KV tensors seam repair window re-encoded with full context SelectiveRecompute: fixed 16-token boundary window CacheBlendRecompute: top ~15% of tokens by cosine deviation Both strategies correct attention context before the main forward pass
Figure 2. The repair window straddles each chunk seam, spanning portions of both cached chunks. KVBoost patches only the tokens that actually deviate from what a full-context forward pass would produce.

The second problem is RoPE position collisions. Key vectors carry their positional encoding baked in. A chunk cached at positions 0–128 is wrong if reused at positions 512–640. The dual-hash scheme handles this: the prefix hash encodes content plus position context (exact reuse, no correction needed), while the content hash encodes content alone (approximate reuse, mandatory deviation repair). Two lookup tiers, two different trust levels.

What’s in the repo

Beyond the core caching engine, a few things make this production-usable rather than a research toy.

Asymmetric KIVI-style quantization cuts cache memory by about half (int8) or 4× (int4) with negligible quality impact. It uses per-channel quantization for keys and per-token quantization for values, which matches how outliers actually distribute across those tensors.

Importance-weighted LRU eviction scores chunks by their ℓ2 key norm before deciding what to evict. A system prompt with high attention signal outlives filler tokens under memory pressure, rather than being evicted purely by recency.

An optional disk-tier overflow stores cold chunks as memory-mapped files. Retrieval takes 10–50ms per chunk, compared to 100–500ms for GPU recompute. Not free, but substantially cheaper.

Adaptive chunk boundary splitting nudges split points toward punctuation so seams fall at natural linguistic breaks. Batch generation loads the shared chunk prefix once across a batch and broadcasts it zero-copy via torch.Tensor.expand. The API is two methods: warm(text) pre-loads documents into cache, generate(prompt) uses it.

Compatibility covers Qwen2, LLaMA, Mistral, Mixtral, Gemma, Phi, and anything else using RoPE with HuggingFace’s past_key_values interface. ALiBi and learned absolute embeddings are not supported — they require different position correction mechanisms that aren’t implemented yet.

The numbers, honestly

Full recompute: 639ms, vLLM prefix cache: 165ms, KVBoost: 142ms
Figure 3. Mean time-to-first-token across 1,000 samples (Qwen2.5-3B, RTX 4060). KVBoost’s advantage over vLLM is largest at short context lengths (1.53× in the 0–500 token bucket) and narrows as context grows.

The 4.49× speedup is real on this workload. The 16% advantage over vLLM reflects two things: KVBoost avoids PagedAttention’s per-page overhead, and its repair strategy means it can safely reuse chunks that prefix caching would have to leave alone.

The accuracy numbers are clean. Seam repair doesn’t hurt quality on bug localization. What I haven’t tested: long-form generation, code completion, multi-document summarization. Those benchmarks would tell a more complete story. I haven’t run them yet, and the benchmark suite in the repo is there if you want to.

Two known rough edges: single-GPU only (multi-GPU tensor parallelism isn’t implemented), and for very long cached contexts (over 8K tokens) the CacheBlend probe pass itself gets expensive — a threshold-based activation would fix it, but it’s not shipped yet.

Who should care

RAG pipelines are the obvious case. Every retrieved document chunk is a cache candidate. Pre-warm the document store once; pay for encoding amortized across thousands of queries instead of per-query. Multi-turn chat benefits similarly — system prompts and early conversation turns get cached and reused across users sharing the same context.

As context windows grow from 128K to 1M tokens, the cost of recomputing shared content scales quadratically. Prefix caching handles that cost only when your architecture happens to front-load all shared content. Most production architectures don’t.

The core idea — that where content appears in a prompt shouldn’t determine whether its KV tensors can be reused — sounds obvious in retrospect. Making it work without quality regression means solving the position encoding problem and the attention seam problem together. I think KVBoost is the first open-source implementation that does both in a single usable package.


Come build this

The repo is at github.com/pythongiant/kvboost, MIT licensed. Issues are open. Benchmark suite and experiment scripts are in benchmarks_and_experiments/important/.

Things I’m actively looking for help with:

If the problem interests you — whether that means opening a PR, stress-testing the benchmarks on a different model or task, or just asking hard questions about the architecture — find me on GitHub.