Star
pip install kvboost
KVBoost
Faster LLM Inference.
Less VRAM. No Model Changes.
Chunk-level KV cache reuse  ·  FlashAttention-2  ·  AWQ layer streaming  ·  CPU paged decoding
The Problem
LLM inference is broken by default.
🧱
VRAM Walls
Modern LLMs like Qwen2.5-32B require 60+ GB VRAM at full precision — out of reach for most teams.
🐢
Slow Prefill
Repeated system prompts are recomputed from scratch on every single request — wasting GPU cycles constantly.
🔧
HF Bottlenecks
HuggingFace's default inference loop has no KV cache reuse, no chunked attention, and no memory-efficient decoding.
The Solution
KVBoost: drop-in, no rewrites.
Python
from kvboost import KVBoost

engine = KVBoost.from_pretrained(
  "Qwen/Qwen2.5-3B"
)

# Warm a shared prefix once
engine.warm("You are a helpful assistant...")

# All subsequent calls reuse cache
result = engine.generate(prompt)

print(result.kv_reuse_ratio)  # ✓ 80%+
KV Cache Reuse
Chunk-level cache reuse eliminates redundant prefill for shared prompts.
🚀
FlashAttention-2
Memory-efficient attention with 3–5× Time to First Token (TTFT) speedup vs vanilla HuggingFace.
💾
AWQ Layer Streaming
Run 32B+ models on 8 GB VRAM via pinned-host weight streaming.
🗄️
CPU Paged Decoding
Spill KV cache to CPU RAM — handle long contexts without OOM errors.
Performance
Real numbers. Real hardware.
500-conversation ShareGPT replay · Qwen2.5-3B · RTX 4060 Laptop (8 GB VRAM) · history accumulates turn-by-turn as in a real session
4.59×
TTFT Speedup
at Turn 8
vs no-cache baseline · ~1 100 ctx tokens · measured, not extrapolated
20 ms
TTFT p50
Flat across turns
Stays ~20 ms from turn 1 → 8 while baseline grows to 122 ms
86.1%
Avg KV Reuse
vs vLLM 70.6%
Chunk-level matching recovers reuse that exact-prefix caching misses
99.2%
Task Accuracy
WARM = COLD
500-sample bug-localization eval · no measurable accuracy loss

TTFT by Turn — KVBoost stays flat, baseline grows linearly

Turn 1 · KVBoost
17 ms
Turn 1 · Baseline
19 ms
Turn 4 · KVBoost
21 ms
Turn 4 · Baseline
49 ms
Turn 8 · KVBoost
27 ms
Turn 8 · Baseline
122 ms (4.59× slower)
ShareGPT replay · Qwen2.5-3B · 500 conversations · history grows naturally each turn. Baseline = HuggingFace AutoModelForCausalLM with no cache reuse.

KV Reuse vs vLLM Prefix Cache — chunk matching beats exact-prefix

Turn 2 · KVBoost
96.9%
Turn 2 · vLLM
76.3%
Turn 4 · KVBoost
99.4%
Turn 4 · vLLM
91.9%
Turn 8 · KVBoost
99.6%
Turn 8 · vLLM
95.9%
Same 500-conversation replay against vLLM with enable_prefix_caching=True. vLLM requires byte-identical prefixes; KVBoost matches at chunk granularity with a boundary-alignment window, recovering reuse when new assistant tokens shift the prefix.
How It Works
Four layers of optimization.
01
Hash Chunks
Incoming prompt is split into chunks. Each chunk is hashed to look up prior cached K/V pairs.
02
Reuse Cache
Matching chunks skip attention entirely. Only novel tokens are forwarded through the transformer.
03
Flash Attention
New tokens run FlashAttention-2 — tiled CUDA kernels with O(√N) memory. No custom model code needed.
04
Page Offload
Long-context KV blocks are evicted to CPU RAM via async DMA — enabling contexts beyond GPU VRAM.
AWQ Streaming + Speculative
Run a 32B model on a 12 GB GPU — faster than llama.cpp.
  Terminal
$ python -m kvboost.streaming.demo_speculative
  --model Qwen/Qwen2.5-32B-Instruct-AWQ
  --draft-model Qwen/Qwen2.5-1.5B-Instruct-AWQ
  --keep-first-k 9 --keep-last-k 9 --gamma 5

--- speculative stats ---
  avg_committed/round:  3.00
  decode_tok/s:       2.79
  target.hit_rate:    1.000

--- vs llama.cpp (same model + draft, -ngl 20) ---
  llama.cpp:    1.9 tok/s
  KVBoost:      2.79 tok/s (1.47×)
1.47×
Decode-throughput vs llama.cpp speculative on the same 32B + 1.5B AWQ pair, same residency budget (18 resident layers), same RTX 3060 12 GB. Marlin INT4 tensor-core GEMM + async layer prefetch are the difference.
8.76 GB
Peak VRAM loading Qwen2.5-32B-AWQ (~19 GB packed). Streamed projections live in pinned host RAM; only the resident layers + active staging slot occupy VRAM at any moment.
2.79 tok/s
Decode-only throughput, greedy-equivalent. avg_committed/round = 3.00 — speculative collapses N target forwards into one. Honest: ~2–5 tok/s on Ampere+, ~0.5 tok/s on Turing (GEMM-bound).
Use Cases
Who needs KVBoost?
💻
AI Coding Assistants
System prompts are re-used across 100s of requests. Cache the context once, speed up every response by 3–5×.
📚
RAG Pipelines
Document chunks appear in many queries. Chunk-level reuse makes multi-document QA dramatically faster.
⚙️
Edge / Budget Infra
AWQ streaming lets teams deploy 30B+ models on consumer GPUs — no $10K A100 required.
💬
Multi-Turn Chatbots
Conversation history grows each turn. CPU paged decoding handles long contexts without OOM crashes.
MIT Licensed  ·  Drop-in with HuggingFace Transformers  ·  No fine-tuning, no architecture changes
Technology
Built on solid foundations.
FlashAttention-2
Tiled CUDA kernels for O(√N) memory attention
AWQ (AutoQuant)
Weight-only 4-bit quantization preserving accuracy
HuggingFace Transformers
Drop-in compatibility — no model changes required
CUDA DMA Streams
Async PCIe transfers for layer-by-layer weight streaming
Chunk Hashing
Deterministic token-level hashing for cache lookup
CPU Paged Memory
Page-table KV offload — evict cold blocks to RAM
PyPI Package
pip install kvboost — ready in 2 minutes
MIT License
Fully open source, production-ready for any use
Roadmap
What's next.
Now ✅
✓  Chunk-level KV reuse + CacheBlend seam repair
✓  FlashAttention-2 CUDA kernel (sm_70 → sm_90)
✓  AWQ layer streaming (run 32B on 12 GB)
✓  Speculative decoding (1.47× vs llama.cpp)
✓  CPU paged decoding
✓  OpenAI-compatible HTTP server
Next 🔨
◦  Multi-GPU tensor parallel
◦  LoRA adapter hot-swap
◦  Continuous batching
◦  Agent-trace benchmark suite
Future 🔭
◦  GGUF / GGML support
◦  Triton custom kernels
◦  Distributed KV cache
◦  Cloud-hosted cache tier
Cost Impact
What you stop paying for.
Every request that shares a prefix with a recent request skips that prefill work entirely. Plug in your traffic to see the prefill GPU-seconds — and dollars — that stop being recomputed.
86% = measured average on ShareGPT replay. Drop to 70% for noisier agent traffic.
L4 ≈ $0.50 · A10G ≈ $0.75 · A100 ≈ $1.50 · H100 ≈ $3.50
Prefill rate, measured: 3B ≈ 10k tok/s on a 4060 · 32B AWQ ≈ 900 tok/s on Ampere+.
344M
Prefill tokens / day stopped being recomputed
9.6 h
GPU-hours / day reclaimed
$430
GPU spend / month avoided at the rate above
$5,160
Annualized — before headroom freed for more concurrency
Math: tokens_saved = requests/day × prefix_tokens × reuse_rate · GPU-seconds = tokens_saved / model_prefill_rate · $/month = GPU-seconds × $/hr × 30 / 3600.
Prefill rates measured on the ShareGPT replay & AWQ streaming benchmarks above — your numbers will vary with hardware and prompt shape.
Stop recomputing
the same prefix.
If your agent or coding assistant repeats a 2k–20k token context all day, KVBoost warms it once and reuses it — cutting TTFT and GPU waste without model changes.
$ pip install kvboost
MIT License · Built by @pythongiant · Drop-in with HuggingFace Transformers