Skip to content

ZedInfer

A C++17 LLM inference engine with CPU (x86) and NVIDIA GPU (CUDA) backends. No Python in the serving path.


Core Features

Direct Model Forward

No graph compilation, no IR lowering. A single shared forward loop executes transformer layers directly, parameterized by model config flags. This eliminates graph-level overhead and keeps the execution path transparent and debuggable.

Paged KV Cache

GPU memory is pre-allocated as a fixed-size block pool. Each request's KV cache is backed by non-contiguous blocks, enabling efficient memory utilization across varying sequence lengths. The block pool supports reference counting for sharing, LRU eviction under memory pressure, and O(1) statistics for fast admission decisions.

Prefix Caching

When multiple requests share the same token prefix (e.g., system prompt or shared conversation history), their KV cache blocks are shared via content hashing rather than recomputed. This significantly reduces prefill latency for multi-turn conversations.

Continuous Batching

Decode-first scheduling with chunked prefill. New requests are admitted while existing requests continue generating, maximizing GPU utilization. The scheduler performs block-based admission control to prevent out-of-memory conditions.

Optimized Operator Backends

  • GPU: cuBLAS/cuBLASLt for linear layers (auto-tuned), custom paged attention kernels with online softmax and shared memory optimization
  • CPU: oneDNN for linear (runtime ISA dispatch -- automatically selects AVX2/AVX-512/AMX), AVX-vectorized elementwise ops with OpenMP parallelism

OpenAI-Compatible HTTP API

Drop-in replacement for OpenAI's /v1/chat/completions endpoint with SSE streaming support. Includes a built-in Web UI for interactive chat. Stateful sessions enable multi-turn conversations with KV cache persistence across requests.

DecodeScratch Pre-allocation

Pre-allocated fixed-address GPU buffers for single-token decode eliminate ~500 memory allocation round-trips per decode step, reducing per-token overhead to near zero.


Supported Models

Model Architecture Parameters
DeepSeek-R1-Distill-Qwen-1.5B Qwen2 1.5B
Qwen2.5-Math-1.5B-Instruct Qwen2 1.5B
DeepSeek-R1-0528-Qwen3-8B Qwen3 8B
Qwen3-8B Qwen3 8B

Adding a new Qwen-family model requires only defining a ModelForwardConfig -- zero forward logic changes needed.

Data Types

BF16, FP16, FP32

Quick Start

docker run --gpus all -p 8080:8080 \
    -v /path/to/models:/models \
    tianyuxbear/zedinfer:latest /models/DeepSeek-R1-Distill-Qwen-1.5B --nvidia

Then open http://localhost:8080 for the Web UI, or call the API:

curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"role":"user","content":"Hello"}]}'

See the Quick Start Guide for details.