inference-x/BENCHMARKS.md

# Benchmarks

Real benchmark results from Inference-X running on commodity hardware. No cherry-picking, no warm cache, no tricks.

## AMD EPYC (AVX2+FMA)

**Server:** AMD EPYC Rome | 17 GB RAM | 6 cores | AVX2+FMA
**Binary:** 305 KB | Compiled with `-O3 -march=native`
**Date:** February 2026

| Model | Parameters | Quantization | Architecture | tok/s | Prefill |
|-------|-----------|-------------|--------------|-------|---------|
| SmolLM2 | 135M | Q8_0 | LLAMA | **130.23** | 87 ms |
| Llama 3.2 | 3B | Q4_K_M | LLAMA | **3.82** | 3.8 s |
| Qwen 2.5 | 3B | Q4_K_M | QWEN2 | **3.85** | 16.5 s |
| Qwen 2.5 | 7B | Q4_K_M | QWEN2 | **1.82** | 39.5 s |
| Mistral 7B v0.3 | 7B | Q4_K_M | LLAMA | **2.06** | 39.2 s |
| Llama 3.1 | 8B | Q4_K_M | LLAMA | **1.75** | 43.0 s |
| DeepSeek R1 Qwen | 7B | Q4_K_M | QWEN2 | **1.80** | 38.2 s |
| Gemma 2 | 9B | Q4_K_M | GEMMA2 | **1.28** | 55.5 s |
| DeepSeek R1 Qwen | 14B | Q4_K_M | QWEN2 | **0.97** | 74.1 s |

**9/10 models passing.** All benchmarks from cold start. No caching. CPU-only.

### What this means

These are CPU-only numbers on a €20/month server. No GPU. The same binary, unchanged, scales from 135M to 14B parameters. The protocol doesn't care about the model — it reads what the model describes.

### Chat template auto-detection

Every model above was run with zero manual configuration. The engine reads the GGUF metadata and selects the correct chat template automatically:

| Template | Models |
|----------|--------|
| ChatML | SmolLM2, Qwen 2.5 (all), DeepSeek R1 |
| Llama 3 | Llama 3.2, Llama 3.1 |
| Mistral | Mistral 7B |
| Gemma | Gemma 2 |

## Running your own benchmarks

```bash
# Quick test
./examples/bench.sh /path/to/model.gguf

# Or manually
make
./inference-x /path/to/model.gguf -p "The capital of France is" -n 64
```

We welcome benchmark contributions from different hardware. Submit your results via pull request.