KVBoost

pip install kvboost

KVBoost

Faster LLM Inference.
Less VRAM. No Model Changes.

Chunk-level KV cache reuse · FlashAttention-2 · AWQ layer streaming · CPU paged decoding

⚡

The Problem

LLM inference is broken by default.

🧱

VRAM Walls

Modern LLMs like Qwen2.5-32B require 60+ GB VRAM at full precision — out of reach for most teams.

🐢

Slow Prefill

Repeated system prompts are recomputed from scratch on every single request — wasting GPU cycles constantly.

🔧

HF Bottlenecks

HuggingFace's default inference loop has no KV cache reuse, no chunked attention, and no memory-efficient decoding.

The Solution

KVBoost: drop-in, no rewrites.

Python

            from kvboost import KVBoost

            engine = KVBoost.from_pretrained(

              "Qwen/Qwen2.5-3B"

            )

            # Warm a shared prefix once

            engine.warm("You are a helpful assistant...")

            # All subsequent calls reuse cache

            result = engine.generate(prompt)

            print(result.kv_reuse_ratio)
             # ✓ 80%+

⚡

KV Cache Reuse

Chunk-level cache reuse eliminates redundant prefill for shared prompts.

🚀

FlashAttention-2

Memory-efficient attention with 3–5× Time to First Token (TTFT) speedup vs vanilla HuggingFace.

💾

AWQ Layer Streaming

Run 32B+ models on 8 GB VRAM via pinned-host weight streaming.

🗄️

CPU Paged Decoding

Spill KV cache to CPU RAM — handle long contexts without OOM errors.

Performance

Real numbers. Real hardware.

500-conversation ShareGPT replay · Qwen2.5-3B · RTX 4060 Laptop (8 GB VRAM) · history accumulates turn-by-turn as in a real session

4.59×

TTFT Speedup
at Turn 8

vs no-cache baseline · ~1 100 ctx tokens · measured, not extrapolated

20 ms

TTFT p50
Flat across turns

Stays ~20 ms from turn 1 → 8 while baseline grows to 122 ms

86.1%

Avg KV Reuse
vs vLLM 70.6%

Chunk-level matching recovers reuse that exact-prefix caching misses

99.2%

Task Accuracy
WARM = COLD

500-sample bug-localization eval · no measurable accuracy loss

TTFT by Turn — KVBoost stays flat, baseline grows linearly

Turn 1 · KVBoost

17 ms

Turn 1 · Baseline

19 ms

Turn 4 · KVBoost

21 ms

Turn 4 · Baseline

49 ms

Turn 8 · KVBoost

27 ms

Turn 8 · Baseline

122 ms (4.59× slower)

ShareGPT replay · Qwen2.5-3B · 500 conversations · history grows naturally each turn. Baseline = HuggingFace AutoModelForCausalLM with no cache reuse.

KV Reuse vs vLLM Prefix Cache — chunk matching beats exact-prefix

Turn 2 · KVBoost

96.9%

Turn 2 · vLLM

76.3%

Turn 4 · KVBoost

99.4%

Turn 4 · vLLM

91.9%

Turn 8 · KVBoost

99.6%

Turn 8 · vLLM

95.9%

Same 500-conversation replay against vLLM with enable_prefix_caching=True. vLLM requires byte-identical prefixes; KVBoost matches at chunk granularity with a boundary-alignment window, recovering reuse when new assistant tokens shift the prefix.

How It Works

Four layers of optimization.

01

Hash Chunks

Incoming prompt is split into chunks. Each chunk is hashed to look up prior cached K/V pairs.

02

Reuse Cache

Matching chunks skip attention entirely. Only novel tokens are forwarded through the transformer.

03

Flash Attention

New tokens run FlashAttention-2 — tiled CUDA kernels with O(√N) memory. No custom model code needed.

04

Page Offload

Long-context KV blocks are evicted to CPU RAM via async DMA — enabling contexts beyond GPU VRAM.

AWQ Streaming + Speculative

Run a 32B model on a 12 GB GPU — faster than llama.cpp.

Terminal

$ python -m kvboost.streaming.demo_speculative
  --model Qwen/Qwen2.5-32B-Instruct-AWQ
  --draft-model Qwen/Qwen2.5-1.5B-Instruct-AWQ
  --keep-first-k 9 --keep-last-k 9 --gamma 5

--- speculative stats ---
  avg_committed/round:  3.00
  decode_tok/s:       2.79
  target.hit_rate:    1.000

--- vs llama.cpp (same model + draft, -ngl 20) ---
  llama.cpp:    1.9 tok/s
  KVBoost:      2.79 tok/s (1.47×)

1.47×

Decode-throughput vs llama.cpp speculative on the same 32B + 1.5B AWQ pair, same residency budget (18 resident layers), same RTX 3060 12 GB. Marlin INT4 tensor-core GEMM + async layer prefetch are the difference.

8.76 GB

Peak VRAM loading Qwen2.5-32B-AWQ (~19 GB packed). Streamed projections live in pinned host RAM; only the resident layers + active staging slot occupy VRAM at any moment.

2.79 tok/s

Decode-only throughput, greedy-equivalent. avg_committed/round = 3.00 — speculative collapses N target forwards into one. Honest: ~2–5 tok/s on Ampere+, ~0.5 tok/s on Turing (GEMM-bound).

Use Cases

Who needs KVBoost?

💻

AI Coding Assistants

System prompts are re-used across 100s of requests. Cache the context once, speed up every response by 3–5×.

📚

RAG Pipelines

Document chunks appear in many queries. Chunk-level reuse makes multi-document QA dramatically faster.

⚙️

Edge / Budget Infra

AWQ streaming lets teams deploy 30B+ models on consumer GPUs — no $10K A100 required.

💬

Multi-Turn Chatbots

Conversation history grows each turn. CPU paged decoding handles long contexts without OOM crashes.

        MIT Licensed  ·  Drop-in with HuggingFace Transformers  ·  No fine-tuning, no architecture changes
      

Technology

Built on solid foundations.

✓

FlashAttention-2

Tiled CUDA kernels for O(√N) memory attention

✓

AWQ (AutoQuant)

Weight-only 4-bit quantization preserving accuracy

✓

HuggingFace Transformers

Drop-in compatibility — no model changes required

✓

CUDA DMA Streams

Async PCIe transfers for layer-by-layer weight streaming

✓

Chunk Hashing

Deterministic token-level hashing for cache lookup

✓

CPU Paged Memory

Page-table KV offload — evict cold blocks to RAM

✓

PyPI Package

pip install kvboost — ready in 2 minutes

✓

MIT License

Fully open source, production-ready for any use

Roadmap

What's next.

Now ✅

✓ Chunk-level KV reuse + CacheBlend seam repair

✓ FlashAttention-2 CUDA kernel (sm_70 → sm_90)

✓ AWQ layer streaming (run 32B on 12 GB)

✓ Speculative decoding (1.47× vs llama.cpp)

✓ CPU paged decoding

✓ OpenAI-compatible HTTP server

Next 🔨

◦ Multi-GPU tensor parallel

◦ LoRA adapter hot-swap

◦ Continuous batching

◦ Agent-trace benchmark suite

Future 🔭

◦ GGUF / GGML support

◦ Triton custom kernels

◦ Distributed KV cache

◦ Cloud-hosted cache tier

Cost Impact

What you stop paying for.

Every request that shares a prefix with a recent request skips that prefill work entirely. Plug in your traffic to see the prefill GPU-seconds — and dollars — that stop being recomputed.

Requests / day 100,000

Avg shared prefix tokens 4,000

Reuse rate 86%

86% = measured average on ShareGPT replay. Drop to 70% for noisier agent traffic.

GPU $ / hour $1.50

L4 ≈ $0.50 · A10G ≈ $0.75 · A100 ≈ $1.50 · H100 ≈ $3.50

Model tier

Prefill rate, measured: 3B ≈ 10k tok/s on a 4060 · 32B AWQ ≈ 900 tok/s on Ampere+.

344M

Prefill tokens / day stopped being recomputed

9.6 h

GPU-hours / day reclaimed

$430

GPU spend / month avoided at the rate above

$5,160

Annualized — before headroom freed for more concurrency

        Math: tokens_saved = requests/day × prefix_tokens × reuse_rate · GPU-seconds = tokens_saved / model_prefill_rate · $/month = GPU-seconds × $/hr × 30 / 3600. 

        Prefill rates measured on the ShareGPT replay & AWQ streaming benchmarks above — your numbers will vary with hardware and prompt shape.

Stop recomputing
the same prefix.

If your agent or coding assistant repeats a 2k–20k token context all day, KVBoost warms it once and reuses it — cutting TTFT and GPU waste without model changes.

📊 Request a benchmark on your traffic

Open an issue · we'll replay your workload and post the numbers

⭐ Star on GitHub

github.com/pythongiant/kvboost

📦 PyPI

pypi.org/project/kvboost/

📖 Docs

kvboost.readthedocs.io

$ pip install kvboost

MIT License · Built by @pythongiant · Drop-in with HuggingFace Transformers