inference-x/BENCHMARKS.md
Salka Elmadani ec36668cf5 Inference-X v1.0 — Universal AI Inference Engine
Better output from the same model. Fused computation, adaptive precision,
surgical expert loading. 305 KB, 19 backends, zero dependencies.

https://inference-x.com
2026-02-23 07:10:47 +00:00

1.9 KiB

Benchmarks

Real benchmark results from Inference-X running on commodity hardware. No cherry-picking, no warm cache, no tricks.

AMD EPYC (AVX2+FMA)

Server: AMD EPYC Rome | 17 GB RAM | 6 cores | AVX2+FMA Binary: 305 KB | Compiled with -O3 -march=native Date: February 2026

Model Parameters Quantization Architecture tok/s Prefill
SmolLM2 135M Q8_0 LLAMA 130.23 87 ms
Llama 3.2 3B Q4_K_M LLAMA 3.82 3.8 s
Qwen 2.5 3B Q4_K_M QWEN2 3.85 16.5 s
Qwen 2.5 7B Q4_K_M QWEN2 1.82 39.5 s
Mistral 7B v0.3 7B Q4_K_M LLAMA 2.06 39.2 s
Llama 3.1 8B Q4_K_M LLAMA 1.75 43.0 s
DeepSeek R1 Qwen 7B Q4_K_M QWEN2 1.80 38.2 s
Gemma 2 9B Q4_K_M GEMMA2 1.28 55.5 s
DeepSeek R1 Qwen 14B Q4_K_M QWEN2 0.97 74.1 s

9/10 models passing. All benchmarks from cold start. No caching. CPU-only.

What this means

These are CPU-only numbers on a €20/month server. No GPU. The same binary, unchanged, scales from 135M to 14B parameters. The protocol doesn't care about the model — it reads what the model describes.

Chat template auto-detection

Every model above was run with zero manual configuration. The engine reads the GGUF metadata and selects the correct chat template automatically:

Template Models
ChatML SmolLM2, Qwen 2.5 (all), DeepSeek R1
Llama 3 Llama 3.2, Llama 3.1
Mistral Mistral 7B
Gemma Gemma 2

Running your own benchmarks

# Quick test
./examples/bench.sh /path/to/model.gguf

# Or manually
make
./inference-x /path/to/model.gguf -p "The capital of France is" -n 64

We welcome benchmark contributions from different hardware. Submit your results via pull request.