CLI Tools¶
ZedInfer ships 5 binaries in the Docker image. Override the entrypoint to run tools other than serve.
serve (default)¶
HTTP server with Web UI. This is the default entrypoint.
docker run --gpus all -p 8080:8080 --name zedinfer \
-v /path/to/models:/models \
zedinfer:latest /models/Qwen3-8B --nvidia
| Argument | Default | Description |
|---|---|---|
model_path |
-- | Model directory (required) |
--nvidia |
false |
Use GPU backend |
--host |
0.0.0.0 |
Bind address (inside Docker) |
--port |
8080 |
Listen port |
--max-batch-tokens |
2048 |
Max tokens per batch |
--max-batch-requests |
64 |
Max concurrent requests |
--gpu-memory-utilization |
0.9 |
GPU memory fraction for KV cache |
bench¶
Single-request latency profiler. Measures prefill and decode separately.
docker run --gpus all -v /path/to/models:/models \
--entrypoint /app/bench \
zedinfer:latest /models/Qwen3-8B --nvidia -p 128 -d 128 -r 3
| Argument | Default | Description |
|---|---|---|
-p, --prefill-len |
128 |
Prefill token count |
-d, --decode-len |
128 |
Decode token count |
-r, --rounds |
3 |
Benchmark rounds |
batch_bench¶
Batched throughput profiler. Measures total throughput with concurrent requests.
docker run --gpus all -v /path/to/models:/models \
--entrypoint /app/batch_bench \
zedinfer:latest /models/Qwen3-8B --nvidia -b 32 -p 128 -d 128
| Argument | Default | Description |
|---|---|---|
-b, --batch-size |
4 |
Concurrent requests |
-p, --prefill-len |
128 |
Prefill tokens per request |
-d, --decode-len |
128 |
Decode tokens per request |
-r, --rounds |
1 |
Benchmark rounds |
chat¶
Interactive terminal chat with conversation history.
docker run --gpus all -it -v /path/to/models:/models \
--entrypoint /app/chat \
zedinfer:latest /models/Qwen3-8B --nvidia
| Argument | Default | Description |
|---|---|---|
--max-tokens |
16384 |
Max tokens per response |
In-session commands: exit/quit/q to exit, reset/clear to reset conversation.
Note
Requires -it (interactive + tty) for terminal input.
ping¶
Quick single-inference test. Useful for verifying model loading and basic generation.
docker run --gpus all -v /path/to/models:/models \
--entrypoint /app/ping \
zedinfer:latest /models/Qwen3-8B --nvidia --prompt "Who are you?"
| Argument | Default | Description |
|---|---|---|
--prompt |
"Who are you?" |
Prompt text |
Version¶
All binaries support --version: