Roadmap¶

Completed (v0.1.0)¶

Feature	Description
Direct model forward	Single shared forward loop, no graph execution overhead
cuBLAS / oneDNN linear	Auto-tuned GPU GEMM, runtime ISA dispatch on CPU
Paged KV cache	Block pool with ref counting, LRU eviction
Paged attention kernels	Decode (single + batched) and prefill
Continuous batching	Decode-first scheduling with chunked prefill
Prefix caching	Cross-request block sharing via content hashing
HTTP API + Web UI	OpenAI-compatible API with SSE streaming
Docker deployment	Multi-stage build, portable binary, closed-source distribution
CLI version flag	`--version` with git hash and build date injection

Upcoming¶

Priority 1: FlashInfer Integration¶

Replace custom paged attention kernels with FlashInfer's optimized kernels. Expected 3-5x decode speedup, 2-4x prefill speedup.

Approach: AOT (Ahead-Of-Time) kernel generation -- Python codegen runs offline, compiled .cu files link as regular CUDA code. No Python at runtime.

Priority 2: CUDA Graph¶

Capture decode forward pass as CUDA graph, replay with single kernel launch. Saves ~2ms/step in kernel launch overhead. Phase 1 (pre-allocated decode buffers) is complete.

Priority 3: INT8 Quantization¶

2x model memory reduction via per-channel symmetric quantization. Enables larger batch sizes or bigger models on the same hardware.

Priority 4: INT4 Quantization (GPTQ/AWQ)¶

4x model memory reduction. Run 30B+ parameter models on 24GB GPUs.

Priority 5: Heterogeneous CPU/GPU Inference¶

Mixed device execution with per-layer device placement, pinned host memory, and async prefetch with compute overlap.

Priority 6: MoE Expert Offloading¶

Run large Mixture-of-Experts models (e.g., Qwen-30B-A3B) on consumer GPUs by dynamically loading active experts.