pip install kvboost
KVBoost
Faster LLM Inference.
Less VRAM. No Model Changes.
Less VRAM. No Model Changes.
Chunk-level KV cache reuse · FlashAttention-2 · AWQ layer streaming · CPU paged decoding
enable_prefix_caching=True. vLLM requires byte-identical prefixes; KVBoost matches at chunk granularity with a boundary-alignment window, recovering reuse when new assistant tokens shift the prefix.avg_committed/round = 3.00 — speculative collapses N target forwards into one. Honest: ~2–5 tok/s on Ampere+, ~0.5 tok/s on Turing (GEMM-bound).