Operator Benchmarks¶

Linear: cuBLAS vs oneDNN, BF16/FP16/FP32, various matrix sizes
Attention: Decode single vs batched, prefill scaling with sequence length
RMSNorm: CPU (AVX2/AVX-512) vs GPU, varying hidden dimensions
RoPE: CPU vs GPU, throughput vs sequence length
SwiGLU / Add: Elementwise operation throughput
Embedding / Argmax: Lookup and reduction performance

Per-operator performance data. Updated as optimizations land.

Coming soon

Detailed per-operator benchmarks (latency, throughput, dtype comparison) for both CPU and GPU backends will be published here.

Planned Content¶

Operator correctness is verified via Python bindings against PyTorch reference implementations. Each operator is tested across: