inference-x/BENCHMARKS.md
Salka Elmadani ec36668cf5 Inference-X v1.0 — Universal AI Inference Engine
Better output from the same model. Fused computation, adaptive precision,
surgical expert loading. 305 KB, 19 backends, zero dependencies.

https://inference-x.com
2026-02-23 07:10:47 +00:00

52 lines
1.9 KiB
Markdown

# Benchmarks
Real benchmark results from Inference-X running on commodity hardware. No cherry-picking, no warm cache, no tricks.
## AMD EPYC (AVX2+FMA)
**Server:** AMD EPYC Rome | 17 GB RAM | 6 cores | AVX2+FMA
**Binary:** 305 KB | Compiled with `-O3 -march=native`
**Date:** February 2026
| Model | Parameters | Quantization | Architecture | tok/s | Prefill |
|-------|-----------|-------------|--------------|-------|---------|
| SmolLM2 | 135M | Q8_0 | LLAMA | **130.23** | 87 ms |
| Llama 3.2 | 3B | Q4_K_M | LLAMA | **3.82** | 3.8 s |
| Qwen 2.5 | 3B | Q4_K_M | QWEN2 | **3.85** | 16.5 s |
| Qwen 2.5 | 7B | Q4_K_M | QWEN2 | **1.82** | 39.5 s |
| Mistral 7B v0.3 | 7B | Q4_K_M | LLAMA | **2.06** | 39.2 s |
| Llama 3.1 | 8B | Q4_K_M | LLAMA | **1.75** | 43.0 s |
| DeepSeek R1 Qwen | 7B | Q4_K_M | QWEN2 | **1.80** | 38.2 s |
| Gemma 2 | 9B | Q4_K_M | GEMMA2 | **1.28** | 55.5 s |
| DeepSeek R1 Qwen | 14B | Q4_K_M | QWEN2 | **0.97** | 74.1 s |
**9/10 models passing.** All benchmarks from cold start. No caching. CPU-only.
### What this means
These are CPU-only numbers on a €20/month server. No GPU. The same binary, unchanged, scales from 135M to 14B parameters. The protocol doesn't care about the model — it reads what the model describes.
### Chat template auto-detection
Every model above was run with zero manual configuration. The engine reads the GGUF metadata and selects the correct chat template automatically:
| Template | Models |
|----------|--------|
| ChatML | SmolLM2, Qwen 2.5 (all), DeepSeek R1 |
| Llama 3 | Llama 3.2, Llama 3.1 |
| Mistral | Mistral 7B |
| Gemma | Gemma 2 |
## Running your own benchmarks
```bash
# Quick test
./examples/bench.sh /path/to/model.gguf
# Or manually
make
./inference-x /path/to/model.gguf -p "The capital of France is" -n 64
```
We welcome benchmark contributions from different hardware. Submit your results via pull request.