# Benchmarks Real benchmark results from Inference-X running on commodity hardware. No cherry-picking, no warm cache, no tricks. ## AMD EPYC (AVX2+FMA) **Server:** AMD EPYC Rome | 17 GB RAM | 6 cores | AVX2+FMA **Binary:** 305 KB | Compiled with `-O3 -march=native` **Date:** February 2026 | Model | Parameters | Quantization | Architecture | tok/s | Prefill | |-------|-----------|-------------|--------------|-------|---------| | SmolLM2 | 135M | Q8_0 | LLAMA | **130.23** | 87 ms | | Llama 3.2 | 3B | Q4_K_M | LLAMA | **3.82** | 3.8 s | | Qwen 2.5 | 3B | Q4_K_M | QWEN2 | **3.85** | 16.5 s | | Qwen 2.5 | 7B | Q4_K_M | QWEN2 | **1.82** | 39.5 s | | Mistral 7B v0.3 | 7B | Q4_K_M | LLAMA | **2.06** | 39.2 s | | Llama 3.1 | 8B | Q4_K_M | LLAMA | **1.75** | 43.0 s | | DeepSeek R1 Qwen | 7B | Q4_K_M | QWEN2 | **1.80** | 38.2 s | | Gemma 2 | 9B | Q4_K_M | GEMMA2 | **1.28** | 55.5 s | | DeepSeek R1 Qwen | 14B | Q4_K_M | QWEN2 | **0.97** | 74.1 s | **9/10 models passing.** All benchmarks from cold start. No caching. CPU-only. ### What this means These are CPU-only numbers on a €20/month server. No GPU. The same binary, unchanged, scales from 135M to 14B parameters. The protocol doesn't care about the model — it reads what the model describes. ### Chat template auto-detection Every model above was run with zero manual configuration. The engine reads the GGUF metadata and selects the correct chat template automatically: | Template | Models | |----------|--------| | ChatML | SmolLM2, Qwen 2.5 (all), DeepSeek R1 | | Llama 3 | Llama 3.2, Llama 3.1 | | Mistral | Mistral 7B | | Gemma | Gemma 2 | ## Running your own benchmarks ```bash # Quick test ./examples/bench.sh /path/to/model.gguf # Or manually make ./inference-x /path/to/model.gguf -p "The capital of France is" -n 64 ``` We welcome benchmark contributions from different hardware. Submit your results via pull request.