Performance Overview¶

Benchmark results on NVIDIA RTX 4090 (24GB VRAM), BF16 precision.

DeepSeek-R1-Distill-Qwen-1.5B¶

Prefill (tok/s)¶

Prompt Length	ZedInfer	llama.cpp	vLLM	SGLang
128	4,661	15,076	13,332	5,967
256	5,329	21,949	21,961	12,178
512	5,566	24,647	29,476	23,853
1024	5,612	24,259	36,611	34,162

Decode (tok/s)¶

Prompt Length	ZedInfer	ZedInfer (batch=4)	llama.cpp	vLLM	SGLang
128	189.3	513.7	245.4	209.8	247.4
256	170.2	453.8	245.6	209.8	246.4
512	141.2	370.2	241.5	209.2	245.8
1024	105.3	268.6	241.0	208.3	244.4

DeepSeek-R1-0528-Qwen3-8B¶

Prefill (tok/s)¶

Prompt Length	ZedInfer	llama.cpp	vLLM	SGLang
128	2,980	4,442	4,666	3,001
256	3,374	6,601	7,467	5,246
512	2,914	7,430	8,563	6,860
1024	2,261	7,040	9,284	8,213

Decode (tok/s)¶

Prompt Length	ZedInfer	ZedInfer (batch=4)	llama.cpp	vLLM	SGLang
128	54.0	183.3	58.3	57.0	59.5
256	50.4	171.5	58.3	56.7	59.2
512	43.3	149.5	57.7	56.5	58.9
1024	35.3	125.2	57.8	56.1	58.3

Key Observations¶

Prefill gap: ZedInfer's prefill throughput is 2-6x behind vLLM/SGLang, and the gap widens with sequence length. Root cause: the current paged prefill kernel has no IO-aware tiling -- K/V data is loaded from global memory independently per query position with no shared memory reuse.

Decode degradation: Other engines maintain near-constant decode throughput across all sequence lengths. ZedInfer's decode throughput drops as sequence length grows (189 → 105 tok/s for 1.5B at 128 → 1024). Root cause: the decode kernel uses Grid=(nhead,) -- only 12 blocks for 1.5B, utilizing 9% of RTX 4090's 128 SMs. Longer sequences mean more serial work per block.

Batched decode: The batch=4 numbers show that batching effectively multiplies throughput (2.5-2.7x vs single), confirming the continuous batching scheduler works correctly. However, per-request throughput still degrades with sequence length.

FlashInfer integration is the primary optimization target

Replacing the custom paged attention kernels with FlashInfer is expected to deliver 3-5x decode speedup and 2-4x prefill speedup through IO-aware tiling (prefill) and split-K flash-decoding (decode).

Optimizations Applied (v0.1.0)¶

Optimization	Impact
DecodeScratch pre-allocation	Eliminates ~500 Tensor allocations per decode step
Paged decode shared memory precompute	~16% decode improvement (breaks dependent-load chain)
ArgmaxSampler pinned buffer	Eliminated 570 us/call cudaMallocHost
Prefix caching	Skips redundant prefill for shared token prefixes
BlockPool O(1) counters	Fast admission control for continuous batching

Reproducing Benchmarks¶

# Single-request latency
docker run --gpus all -v /path/to/models:/models \
    --entrypoint /app/bench \
    tianyuxbear/zedinfer:latest /models/DeepSeek-R1-Distill-Qwen-1.5B --nvidia -p 128 -d 128 -r 3

# Batched throughput (4 concurrent requests)
docker run --gpus all -v /path/to/models:/models \
    --entrypoint /app/batch_bench \
    tianyuxbear/zedinfer:latest /models/DeepSeek-R1-Distill-Qwen-1.5B --nvidia -b 4 -p 128 -d 128

Comparison engine versions¶

Engine	Version
ZedInfer	v0.1.0
llama.cpp	e4832e3
vLLM	0.13.0
SGLang	0.5.7