CLI Tools¶

ZedInfer ships 5 binaries in the Docker image. Override the entrypoint to run tools other than serve.

serve (default)¶

HTTP server with Web UI. This is the default entrypoint.

docker run --gpus all -p 8080:8080 --name zedinfer \
    -v /path/to/models:/models \
    zedinfer:latest /models/Qwen3-8B --nvidia

Argument	Default	Description
`model_path`	--	Model directory (required)
`--nvidia`	`false`	Use GPU backend
`--host`	`0.0.0.0`	Bind address (inside Docker)
`--port`	`8080`	Listen port
`--max-batch-tokens`	`2048`	Max tokens per batch
`--max-batch-requests`	`64`	Max concurrent requests
`--gpu-memory-utilization`	`0.9`	GPU memory fraction for KV cache

bench¶

Single-request latency profiler. Measures prefill and decode separately.

docker run --gpus all -v /path/to/models:/models \
    --entrypoint /app/bench \
    zedinfer:latest /models/Qwen3-8B --nvidia -p 128 -d 128 -r 3

Argument	Default	Description
`-p, --prefill-len`	`128`	Prefill token count
`-d, --decode-len`	`128`	Decode token count
`-r, --rounds`	`3`	Benchmark rounds

batch_bench¶

Batched throughput profiler. Measures total throughput with concurrent requests.

docker run --gpus all -v /path/to/models:/models \
    --entrypoint /app/batch_bench \
    zedinfer:latest /models/Qwen3-8B --nvidia -b 32 -p 128 -d 128

Argument	Default	Description
`-b, --batch-size`	`4`	Concurrent requests
`-p, --prefill-len`	`128`	Prefill tokens per request
`-d, --decode-len`	`128`	Decode tokens per request
`-r, --rounds`	`1`	Benchmark rounds

chat¶

Interactive terminal chat with conversation history.

docker run --gpus all -it -v /path/to/models:/models \
    --entrypoint /app/chat \
    zedinfer:latest /models/Qwen3-8B --nvidia

Argument	Default	Description
`--max-tokens`	`16384`	Max tokens per response

In-session commands: exit/quit/q to exit, reset/clear to reset conversation.

Note

Requires -it (interactive + tty) for terminal input.

ping¶

Quick single-inference test. Useful for verifying model loading and basic generation.

docker run --gpus all -v /path/to/models:/models \
    --entrypoint /app/ping \
    zedinfer:latest /models/Qwen3-8B --nvidia --prompt "Who are you?"

Argument	Default	Description
`--prompt`	`"Who are you?"`	Prompt text

Version¶

All binaries support --version:

docker run --rm zedinfer:latest --version
# zedinfer 0.1.0 (build 6f1385b, 2026-03-26)