Every request recomputes everything. I built an open-source fix for the cases prefix caching misses.
Every LLM request at your company recomputes everything. System prompt, retrieved documents, few-shot examples — the model reads them fresh, every time. Doesn't matter if those exact tokens showed up in the last million requests. No memory. Full price.
The industry has a partial answer: prefix caching. If requests share a leading prefix token-for-token, vLLM and SGLang will skip that recomputation. It works well when your shared content always sits at the very front. It does nothing when a retrieved document shows up at position 200, or when a system prompt gets shuffled by dynamic context injection, or when two users share common boilerplate that happens not to be the very first tokens.
I got tired of paying the tax where prefix caching doesn't apply. So I built KVBoost. Source: github.com/pythongiant/kvboost, MIT licensed, open for contributors.
Split the prompt into 128-token chunks. Give each chunk two hash keys — one encoding content plus positional context (for exact reuse), one encoding content alone (for approximate reuse regardless of where it appears in the prompt). When a request arrives, check both stores before touching the GPU for those tokens.
The result: 4.49× mean time-to-first-token speedup over full recomputation, and 16% faster than vLLM prefix caching, on 1,000 samples from a bug-localization benchmark using Qwen2.5-3B. Exact-match accuracy: 99.2% vs 99.1% for both baselines.
Two problems nearly killed this.
The first is attention seams. When you stitch cached chunks together, boundary tokens attended to a different preceding context when they were originally encoded. Reusing those KV tensors verbatim produces silent attention errors — the kind that don't crash, don't appear in logs, and quietly degrade output quality.
I implemented two repair strategies. The spatial one (SelectiveRecompute) re-encodes the last 16 tokens before each seam. The deviation-guided one (CacheBlendRecompute, adapted from the CacheBlend paper) does a cheap probe pass, measures per-token cosine deviation, and recomputes only the tokens that actually changed — about 15% of the prompt. The deviation-guided approach catches mid-chunk errors the spatial window misses entirely, and it's mandatory for any approximate content match.
The second problem is RoPE position collisions. Key vectors carry their positional encoding baked in. A chunk cached at positions 0–128 is wrong if reused at positions 512–640. The dual-hash scheme handles this: the prefix hash encodes content plus position context (exact reuse, no correction needed), while the content hash encodes content alone (approximate reuse, mandatory deviation repair). Two lookup tiers, two different trust levels.
Beyond the core caching engine, a few things make this production-usable rather than a research toy.
Asymmetric KIVI-style quantization cuts cache memory by about half (int8) or 4× (int4) with negligible quality impact. It uses per-channel quantization for keys and per-token quantization for values, which matches how outliers actually distribute across those tensors.
Importance-weighted LRU eviction scores chunks by their ℓ2 key norm before deciding what to evict. A system prompt with high attention signal outlives filler tokens under memory pressure, rather than being evicted purely by recency.
An optional disk-tier overflow stores cold chunks as memory-mapped files. Retrieval takes 10–50ms per chunk, compared to 100–500ms for GPU recompute. Not free, but substantially cheaper.
Adaptive chunk boundary splitting nudges split points toward punctuation so seams fall at natural linguistic breaks. Batch generation loads the shared chunk prefix once across a batch and broadcasts it zero-copy via torch.Tensor.expand. The API is two methods: warm(text) pre-loads documents into cache, generate(prompt) uses it.
Compatibility covers Qwen2, LLaMA, Mistral, Mixtral, Gemma, Phi, and anything else using RoPE with HuggingFace’s past_key_values interface. ALiBi and learned absolute embeddings are not supported — they require different position correction mechanisms that aren’t implemented yet.
The 4.49× speedup is real on this workload. The 16% advantage over vLLM reflects two things: KVBoost avoids PagedAttention’s per-page overhead, and its repair strategy means it can safely reuse chunks that prefix caching would have to leave alone.
The accuracy numbers are clean. Seam repair doesn’t hurt quality on bug localization. What I haven’t tested: long-form generation, code completion, multi-document summarization. Those benchmarks would tell a more complete story. I haven’t run them yet, and the benchmark suite in the repo is there if you want to.
Two known rough edges: single-GPU only (multi-GPU tensor parallelism isn’t implemented), and for very long cached contexts (over 8K tokens) the CacheBlend probe pass itself gets expensive — a threshold-based activation would fix it, but it’s not shipped yet.
RAG pipelines are the obvious case. Every retrieved document chunk is a cache candidate. Pre-warm the document store once; pay for encoding amortized across thousands of queries instead of per-query. Multi-turn chat benefits similarly — system prompts and early conversation turns get cached and reused across users sharing the same context.
As context windows grow from 128K to 1M tokens, the cost of recomputing shared content scales quadratically. Prefix caching handles that cost only when your architecture happens to front-load all shared content. Most production architectures don’t.
The core idea — that where content appears in a prompt shouldn’t determine whether its KV tensors can be reused — sounds obvious in retrospect. Making it work without quality regression means solving the position encoding problem and the attention seam problem together. I think KVBoost is the first open-source implementation that does both in a single usable package.
The repo is at github.com/pythongiant/kvboost, MIT licensed. Issues are open. Benchmark suite and experiment scripts are in benchmarks_and_experiments/important/.
Things I’m actively looking for help with:
If the problem interests you — whether that means opening a PR, stress-testing the benchmarks on a different model or task, or just asking hard questions about the architecture — find me on GitHub.