Better output from the same model. Fused computation, adaptive precision, surgical expert loading. 305 KB, 19 backends, zero dependencies. https://inference-x.com
1.9 KiB
Benchmarks
Real benchmark results from Inference-X running on commodity hardware. No cherry-picking, no warm cache, no tricks.
AMD EPYC (AVX2+FMA)
Server: AMD EPYC Rome | 17 GB RAM | 6 cores | AVX2+FMA
Binary: 305 KB | Compiled with -O3 -march=native
Date: February 2026
| Model | Parameters | Quantization | Architecture | tok/s | Prefill |
|---|---|---|---|---|---|
| SmolLM2 | 135M | Q8_0 | LLAMA | 130.23 | 87 ms |
| Llama 3.2 | 3B | Q4_K_M | LLAMA | 3.82 | 3.8 s |
| Qwen 2.5 | 3B | Q4_K_M | QWEN2 | 3.85 | 16.5 s |
| Qwen 2.5 | 7B | Q4_K_M | QWEN2 | 1.82 | 39.5 s |
| Mistral 7B v0.3 | 7B | Q4_K_M | LLAMA | 2.06 | 39.2 s |
| Llama 3.1 | 8B | Q4_K_M | LLAMA | 1.75 | 43.0 s |
| DeepSeek R1 Qwen | 7B | Q4_K_M | QWEN2 | 1.80 | 38.2 s |
| Gemma 2 | 9B | Q4_K_M | GEMMA2 | 1.28 | 55.5 s |
| DeepSeek R1 Qwen | 14B | Q4_K_M | QWEN2 | 0.97 | 74.1 s |
9/10 models passing. All benchmarks from cold start. No caching. CPU-only.
What this means
These are CPU-only numbers on a €20/month server. No GPU. The same binary, unchanged, scales from 135M to 14B parameters. The protocol doesn't care about the model — it reads what the model describes.
Chat template auto-detection
Every model above was run with zero manual configuration. The engine reads the GGUF metadata and selects the correct chat template automatically:
| Template | Models |
|---|---|
| ChatML | SmolLM2, Qwen 2.5 (all), DeepSeek R1 |
| Llama 3 | Llama 3.2, Llama 3.1 |
| Mistral | Mistral 7B |
| Gemma | Gemma 2 |
Running your own benchmarks
# Quick test
./examples/bench.sh /path/to/model.gguf
# Or manually
make
./inference-x /path/to/model.gguf -p "The capital of France is" -n 64
We welcome benchmark contributions from different hardware. Submit your results via pull request.