Skip to content

CLI Tools

ZedInfer ships 5 binaries in the Docker image. Override the entrypoint to run tools other than serve.


serve (default)

HTTP server with Web UI. This is the default entrypoint.

docker run --gpus all -p 8080:8080 --name zedinfer \
    -v /path/to/models:/models \
    zedinfer:latest /models/Qwen3-8B --nvidia
Argument Default Description
model_path -- Model directory (required)
--nvidia false Use GPU backend
--host 0.0.0.0 Bind address (inside Docker)
--port 8080 Listen port
--max-batch-tokens 2048 Max tokens per batch
--max-batch-requests 64 Max concurrent requests
--gpu-memory-utilization 0.9 GPU memory fraction for KV cache

bench

Single-request latency profiler. Measures prefill and decode separately.

docker run --gpus all -v /path/to/models:/models \
    --entrypoint /app/bench \
    zedinfer:latest /models/Qwen3-8B --nvidia -p 128 -d 128 -r 3
Argument Default Description
-p, --prefill-len 128 Prefill token count
-d, --decode-len 128 Decode token count
-r, --rounds 3 Benchmark rounds

batch_bench

Batched throughput profiler. Measures total throughput with concurrent requests.

docker run --gpus all -v /path/to/models:/models \
    --entrypoint /app/batch_bench \
    zedinfer:latest /models/Qwen3-8B --nvidia -b 32 -p 128 -d 128
Argument Default Description
-b, --batch-size 4 Concurrent requests
-p, --prefill-len 128 Prefill tokens per request
-d, --decode-len 128 Decode tokens per request
-r, --rounds 1 Benchmark rounds

chat

Interactive terminal chat with conversation history.

docker run --gpus all -it -v /path/to/models:/models \
    --entrypoint /app/chat \
    zedinfer:latest /models/Qwen3-8B --nvidia
Argument Default Description
--max-tokens 16384 Max tokens per response

In-session commands: exit/quit/q to exit, reset/clear to reset conversation.

Note

Requires -it (interactive + tty) for terminal input.


ping

Quick single-inference test. Useful for verifying model loading and basic generation.

docker run --gpus all -v /path/to/models:/models \
    --entrypoint /app/ping \
    zedinfer:latest /models/Qwen3-8B --nvidia --prompt "Who are you?"
Argument Default Description
--prompt "Who are you?" Prompt text

Version

All binaries support --version:

docker run --rm zedinfer:latest --version
# zedinfer 0.1.0 (build 6f1385b, 2026-03-26)