TechTrends Now - Tech News for Builders and Operators

The Problem

When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time the same prompt reappears.

tierKV intercepts evicted KV blocks, quantizes them, ships them to a vault on a LAN machine, and restores them on the next cache miss — injecting directly into vLLM's paged KV buffer with no attention recomputation. It integrates via vLLM's KVConnectorBase_V1 plugin API with no source changes required.

Benchmarks (Qwen3.6-35B-A3B, Apple FY2025 10-K, 30,561 tokens)

We ran the Apple FY2025 10-K filing through three scenarios. A full cold prefill with no cache took 10.75 seconds. A GPU cache hit (blocks already in VRAM) dropped that to 1.19 seconds. The cold vault restore came in at 0.52 seconds — 20× faster than a full prefill, and faster than the GPU cache hit.

Vault restore beats GPU cache hit because it bypasses attention computation entirely. GPU hits still run partial attention; vault blocks go straight into the buffer. The gap widens with context length — projected ~35× speedup at 128k tokens since prefill is O(n²) and restore is O(n) + network.

tierKV also supports EXO via a post-install patch. On an 8,000-token prompt: 30.83s cold → 4.11s restored (7.3×).

Architecture

Three tiers:

[Hot]  GPU KV cache  — VRAM, in-engine prefix cache
[Cold] KV vault      — LAN machine RAM, ~0.5ms away, gRPC
[Cold] SSM vault     — separate LAN machine for SSM/linear-attention layers

Eviction path: GPU block evicted → TurboQuant INT8 encode → fire-and-forget gRPC Store → GPU block freed immediately.

Restore path: Cache miss → BatchPromote RPC (all layers, one round-trip) → parallel rayon decode (GIL released) → tensors injected into paged KV buffer.

TurboQuant is a per-group INT8 quantizer written in Rust. Groups are aligned to attention head boundaries (group size = head dim, e.g. 256 for Qwen3.6-35B-A3B), so outlier heads can't corrupt neighboring groups. Result: 3.9× compression at ≥52 dB SNR.

Hybrid models like Qwen3.6-35B-A3B (10 full-attention + 30 linear-attention layers) route the two layer types to separate vaults automatically — no manual config per model.

Setup

Step 1 — Install on all machines:

pip install tierkv

No Cargo, no cmake. The Rust core is bundled in the wheel.

Step 2 — Configure each machine (tierkv.toml):

Inference node:

[cluster]
role = "inference"
kv_cold = "192.168.1.10:50051"
ssm_cold = "192.168.1.11:50051"

[turbo_quant]
enabled = true
kv_dim = 256  # match your model's attention head dimension

KV vault machine:

[cluster]
role = "kv_cold"

[vault]
max_bytes = 24_000_000_000  # 24 GB

Step 3 — Start vault servers on cold machines:

tierkv vault

Step 4 — Verify connectivity:

tierkv status

Step 5 — Launch vLLM:

vllm serve Qwen/Qwen3-30B-A3B \
  --kv-transfer-config '{
    "kv_connector": "TierKVConnector",
    "kv_connector_module_path": "tierkv.connectors.vllm.connector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {"config_path": "/path/to/tierkv.toml"}
  }' \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --block-size 16

That's it — no vLLM source changes, no rebuilding. tierKV intercepts eviction and restore automatically.

EXO users: tierkv install --exo-path /path/to/exo patches EXO in place. Then launch EXO as normal.

Our Test Cluster

Inference node: NVIDIA DGX Spark (GB10, 96 GB HBM) — runs vLLM or EXO
KV cold vault: Apple Mac Pro (M2 Pro, 32 GB RAM) — 24 GB reserved for KV blocks
SSM cold vault: Apple Mac Air (M2, 16 GB RAM) — 12 GB reserved for SSM states
Network: 5GbE LAN, ~0.5ms RTT

Deliberately modest hardware. The vault nodes are otherwise idle machines — no GPU required.

When tierKV Helps

Repeated long-context prompts (RAG over fixed docs, chat history, system prompts)
Multi-user serving with shared prefixes — first request warms the vault, all others benefit
Hybrid MoE + SSM models where both layer types need separate cold storage
Tight VRAM budget relative to context length

When It Doesn't Help

Single-shot prompts that never repeat
High-latency networks (WiFi, WAN) — assumes sub-5ms LAN RTT
Tensor-parallel multi-GPU inference — not yet supported
Very short prompts on hybrid models (below HMA block size threshold)
Applications requiring bit-for-bit identical output (use turbo_quant = false)

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

The Problem

Benchmarks (Qwen3.6-35B-A3B, Apple FY2025 10-K, 30,561 tokens)

Architecture

Setup

Our Test Cluster

When tierKV Helps

When It Doesn't Help

Links

Comments (0)

United States

Related News

What Does "Building in Public" Actually Mean in 2026?

The Agentic Headless Backend: What Vibe Coders Still Need After the UI Is Done

Why I’m Still Learning to Code Even With AI

Students Boo Commencement Speaker After She Calls AI the 'Next Industrial Revolution'

Testing for ‘Bad Cholesterol’ Doesn’t Tell the Whole Story