Better output from the same model. Fused computation, adaptive precision, surgical expert loading. 305 KB, 19 backends, zero dependencies. https://inference-x.com
52 lines
1.9 KiB
Markdown
52 lines
1.9 KiB
Markdown
# Benchmarks
|
|
|
|
Real benchmark results from Inference-X running on commodity hardware. No cherry-picking, no warm cache, no tricks.
|
|
|
|
## AMD EPYC (AVX2+FMA)
|
|
|
|
**Server:** AMD EPYC Rome | 17 GB RAM | 6 cores | AVX2+FMA
|
|
**Binary:** 305 KB | Compiled with `-O3 -march=native`
|
|
**Date:** February 2026
|
|
|
|
| Model | Parameters | Quantization | Architecture | tok/s | Prefill |
|
|
|-------|-----------|-------------|--------------|-------|---------|
|
|
| SmolLM2 | 135M | Q8_0 | LLAMA | **130.23** | 87 ms |
|
|
| Llama 3.2 | 3B | Q4_K_M | LLAMA | **3.82** | 3.8 s |
|
|
| Qwen 2.5 | 3B | Q4_K_M | QWEN2 | **3.85** | 16.5 s |
|
|
| Qwen 2.5 | 7B | Q4_K_M | QWEN2 | **1.82** | 39.5 s |
|
|
| Mistral 7B v0.3 | 7B | Q4_K_M | LLAMA | **2.06** | 39.2 s |
|
|
| Llama 3.1 | 8B | Q4_K_M | LLAMA | **1.75** | 43.0 s |
|
|
| DeepSeek R1 Qwen | 7B | Q4_K_M | QWEN2 | **1.80** | 38.2 s |
|
|
| Gemma 2 | 9B | Q4_K_M | GEMMA2 | **1.28** | 55.5 s |
|
|
| DeepSeek R1 Qwen | 14B | Q4_K_M | QWEN2 | **0.97** | 74.1 s |
|
|
|
|
**9/10 models passing.** All benchmarks from cold start. No caching. CPU-only.
|
|
|
|
### What this means
|
|
|
|
These are CPU-only numbers on a €20/month server. No GPU. The same binary, unchanged, scales from 135M to 14B parameters. The protocol doesn't care about the model — it reads what the model describes.
|
|
|
|
### Chat template auto-detection
|
|
|
|
Every model above was run with zero manual configuration. The engine reads the GGUF metadata and selects the correct chat template automatically:
|
|
|
|
| Template | Models |
|
|
|----------|--------|
|
|
| ChatML | SmolLM2, Qwen 2.5 (all), DeepSeek R1 |
|
|
| Llama 3 | Llama 3.2, Llama 3.1 |
|
|
| Mistral | Mistral 7B |
|
|
| Gemma | Gemma 2 |
|
|
|
|
## Running your own benchmarks
|
|
|
|
```bash
|
|
# Quick test
|
|
./examples/bench.sh /path/to/model.gguf
|
|
|
|
# Or manually
|
|
make
|
|
./inference-x /path/to/model.gguf -p "The capital of France is" -n 64
|
|
```
|
|
|
|
We welcome benchmark contributions from different hardware. Submit your results via pull request.
|