Salka Elmadani ec36668cf5 Inference-X v1.0 — Universal AI Inference Engine

Better output from the same model. Fused computation, adaptive precision,
surgical expert loading. 305 KB, 19 backends, zero dependencies.

https://inference-x.com

2026-02-23 07:10:47 +00:00

1.9 KiB

Raw Blame History

Benchmarks

Real benchmark results from Inference-X running on commodity hardware. No cherry-picking, no warm cache, no tricks.

AMD EPYC (AVX2+FMA)

Server: AMD EPYC Rome | 17 GB RAM | 6 cores | AVX2+FMA Binary: 305 KB | Compiled with -O3 -march=native Date: February 2026

Model	Parameters	Quantization	Architecture	tok/s	Prefill
SmolLM2	135M	Q8_0	LLAMA	130.23	87 ms
Llama 3.2	3B	Q4_K_M	LLAMA	3.82	3.8 s
Qwen 2.5	3B	Q4_K_M	QWEN2	3.85	16.5 s
Qwen 2.5	7B	Q4_K_M	QWEN2	1.82	39.5 s
Mistral 7B v0.3	7B	Q4_K_M	LLAMA	2.06	39.2 s
Llama 3.1	8B	Q4_K_M	LLAMA	1.75	43.0 s
DeepSeek R1 Qwen	7B	Q4_K_M	QWEN2	1.80	38.2 s
Gemma 2	9B	Q4_K_M	GEMMA2	1.28	55.5 s
DeepSeek R1 Qwen	14B	Q4_K_M	QWEN2	0.97	74.1 s

9/10 models passing. All benchmarks from cold start. No caching. CPU-only.

What this means

These are CPU-only numbers on a €20/month server. No GPU. The same binary, unchanged, scales from 135M to 14B parameters. The protocol doesn't care about the model — it reads what the model describes.

Chat template auto-detection

Every model above was run with zero manual configuration. The engine reads the GGUF metadata and selects the correct chat template automatically:

Template	Models
ChatML	SmolLM2, Qwen 2.5 (all), DeepSeek R1
Llama 3	Llama 3.2, Llama 3.1
Mistral	Mistral 7B
Gemma	Gemma 2

Running your own benchmarks

# Quick test
./examples/bench.sh /path/to/model.gguf

# Or manually
make
./inference-x /path/to/model.gguf -p "The capital of France is" -n 64

We welcome benchmark contributions from different hardware. Submit your results via pull request.