inference-x/README.md

# Inference-X

[![Build](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml/badge.svg)](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml)
[![Release](https://img.shields.io/github/v/release/ElmadaniS/inference-x)](https://github.com/ElmadaniS/inference-x/releases)
[![License](https://img.shields.io/badge/license-BSL--1.1-blue)](LICENSE)
[![Binary Size](https://img.shields.io/badge/binary-305%20KB-brightgreen)](TECHNOLOGY.md)
[![Backends](https://img.shields.io/badge/backends-19-orange)](ARCHITECTURE.md)

**Better output from the same model.**

One binary routes any AI model to any hardware — from a microcontroller to a datacenter. Fused computation, adaptive precision, surgical expert loading. No dependencies. No framework. No vendor lock-in.

305 KB. 19 hardware backends. Any model. Any scale.

Built in Morocco by [Salka Elmadani](https://x.com/ElmadaniSa13111).

> *In the Anti-Atlas, our ancestors built khettaras — underground water channels that deliver pure water to villages without pumps, without electricity, without filtration. The water arrives cleaner than any treated supply because the path itself is the filter. Inference-X works the same way: the shortest path produces the cleanest signal.*

**[Website](https://inference-x.com)** · **[How it works](TECHNOLOGY.md)** · **[Benchmarks](BENCHMARKS.md)** · **[Vision](VISION.md)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)**

---

## What makes it different

Most inference engines add layers between the model and the hardware: frameworks, runtime allocators, intermediate buffers, uniform precision pipelines. Each layer adds computational overhead that degrades the model's original signal.

Inference-X removes those layers.

**Fused computation** — Dequantization and matrix multiply happen in a single instruction loop. No intermediate FP32 buffer. Fewer rounding operations means output closer to the model's theoretical FP32 maximum.

**Adaptive precision** — Each query is analyzed before inference. Simple questions get compressed early layers and full-precision decision layers. Complex reasoning gets full precision throughout. The model adapts its depth to the question — same file, same binary, different computational path.

**Surgical expert loading** — For Mixture-of-Experts models, only active experts exist in memory. Inactive experts are evicted at the OS level. Result: a 1-trillion-parameter model runs on 17 GB of RAM. The signal path contains only what contributes to the current token.

The result: **the same model produces higher-fidelity output through a cleaner computation path.** Or equivalently: a smaller model through Inference-X can match a larger model through a conventional engine.

→ [Full technical explanation](TECHNOLOGY.md)

---

## What it is

TCP/IP routes data packets to any network, any hardware, any destination. The protocol doesn't care about the wire.

Inference-X routes intelligence to any silicon. The protocol doesn't care about the chip.

One function call enters `kernel_dispatch.h`. On the other side: CPU, GPU, TPU, LPU, IPU, FPGA, DSP, or WSE. The caller doesn't know. Doesn't need to. The model runs. The answer comes back.

```
Model (any GGUF) → Inference-X (305 KB) → Silicon (any of 19 backends) → Response
```

The model describes itself. The engine reads the description. The engine never assumes.


## Quick Start

```bash
git clone https://github.com/ElmadaniS/inference-x
cd inference-x
make

# Download a model (any GGUF from Hugging Face)
./inference-x model.gguf -p "Hello, world"
```

That's it. One binary. One command. Any model.


## Why it matters

Running a model today requires choosing a stack: CUDA for NVIDIA, ROCm for AMD, Metal for Apple, TensorRT for serving, vLLM for throughput, Ollama for local. Each stack locks you to a vendor, a way of thinking, and adds its own computational overhead between the model and the result.

Inference-X eliminates the stack. There is no stack. There's a model file, a binary, and your hardware — whatever it is.

```
GPU cluster:  1T parameters on 8× H100     ~5.6 kW, $200,000+/year
Inference-X:  1T parameters on 256 GB RAM   ~300 W, €4,800/year

Same model. Cleaner output. 97% less cost.
```

This isn't about replacing GPUs. It's about making the choice of silicon irrelevant to the act of thinking — and getting *better* results from the silicon you already have.


## Who is this for

**Every organization that runs AI models — or wants to.**

| Sector | Problem | What IX does |
|--------|---------|-------------|
| **Healthcare** | Patient data can't leave the hospital. Cloud inference = compliance risk. | Air-gapped inference on hospital hardware. Zero network calls. HIPAA/GDPR by architecture. |
| **Defense & Government** | Sovereign AI requires sovereign infrastructure. | Runs on government-owned hardware. No vendor dependency. No telemetry. Auditable source. |
| **Finance** | Trading models need low latency and full auditability. | On-premise inference, deterministic output, no external calls. |
| **Telecom** | Edge inference at cell towers for real-time processing. | 305 KB binary deploys on edge hardware. Adaptive precision matches available power. |
| **Automotive** | In-vehicle AI needs minimal footprint and guaranteed response. | Runs on ARM/Snapdragon. No framework overhead. Fits in L2 cache. |
| **Startups** | GPU costs eat runway. $200K/year for inference infrastructure. | Same model quality at 97% lower cost. CPU-only. Scale when you're ready. |
| **Enterprise** | Vendor lock-in across NVIDIA, AMD, Intel, cloud providers. | 19 backends. One binary. Switch hardware without changing code. |
| **Research & Education** | Limited compute budgets. Students can't afford H100s. | Free under BSL-1.1. Run 14B models on a €20/month server. |
| **Embedded / IoT** | AI on microcontrollers with KB-level memory budgets. | Compiles for ESP32. Surgical loading keeps memory minimal. |
| **Cloud Providers** | Offering inference services at competitive margins. | Higher output quality per compute dollar. 19 backends = any customer hardware. |

Inference-X has zero friction with existing infrastructure. It doesn't replace your hardware — it makes your hardware work better.


## Get started

```bash
# Build (30 seconds)
git clone https://github.com/ElmadaniS/inference-x.git
cd inference-x && make -j$(nproc)

# Chat with any GGUF model
./inference-x model.gguf -i

# Or start a web interface
python3 web/ix_server.py

# Or run as an OpenAI-compatible API
./inference-x model.gguf --serve --port 8080
```

Three commands. No dependencies. No Docker. No Python packages. No GPU drivers. Just `make` and run.


## Benchmarks

Real numbers on a €20/month AMD EPYC server. CPU-only. No GPU. Cold start.

| Model | Params | Quant | tok/s |
|-------|--------|-------|-------|
| SmolLM2 | 135M | Q8_0 | **130.23** |
| Llama 3.2 | 3B | Q4_K_M | **3.82** |
| Qwen 2.5 | 3B | Q4_K_M | **3.85** |
| Mistral 7B | 7B | Q4_K_M | **2.06** |
| Qwen 2.5 | 7B | Q4_K_M | **1.82** |
| Llama 3.1 | 8B | Q4_K_M | **1.75** |
| Gemma 2 | 9B | Q4_K_M | **1.28** |
| DS-R1 Qwen | 14B | Q4_K_M | **0.97** |

9/10 architectures passing. Chat templates auto-detected. Zero manual configuration.

→ [Full benchmark details](BENCHMARKS.md)


## Supported Hardware

| Backend | Silicon | Status |
|---------|---------|--------|
| CPU (AVX2/AVX-512) | Intel, AMD | ✅ Production |
| CUDA | NVIDIA GPU | ✅ Production |
| ROCm | AMD GPU | ✅ Production |
| Metal | Apple Silicon | ✅ Production |
| Vulkan | Cross-platform GPU | ✅ Production |
| ARM NEON | ARM processors | ✅ Production |
| Snapdragon | Qualcomm (GPU+DSP+NEON) | 🔧 Ready |
| Hexagon HVX | Qualcomm DSP | 🔧 Ready |
| OpenCL | Cross-platform | 🔧 Ready |
| WebGPU | Browser | 🔧 Ready |
| TPU | Google | 🔧 Ready |
| Inferentia | AWS | 🔧 Ready |
| Gaudi | Intel HPU | 🔧 Ready |
| Maia | Microsoft | 🔧 Ready |
| SambaNova RDU | SambaNova | 🔧 Ready |
| Graphcore IPU | Graphcore | 🔧 Ready |
| Groq LPU | Groq | 🔧 Ready |
| FPGA (Xilinx) | Xilinx | 🔧 Ready |
| Cerebras WSE | Cerebras | 🔧 Ready |


## Architecture

```
infer.cpp                  ← Entry point (571 lines)
├── runtime/
│   ├── gguf.h             ← GGUF parser + config extraction
│   ├── tokenizer.h        ← Tokenizer with byte-level BPE
│   ├── transformer_v6.h   ← Universal forward pass
│   ├── attention.h        ← GQA attention
│   ├── moe_mla.h          ← MoE + MLA (DeepSeek V3)
│   ├── gemm.h             ← Fused GEMV kernels
│   ├── kernels.h          ← RMS norm, softmax, RoPE, SiLU
│   ├── kernel_dispatch.h  ← Hardware routing layer
│   ├── server.h           ← OpenAI-compatible API server
│   └── ...
├── core/
│   ├── iq_tables.h        ← IQ quantization lookup tables
│   └── z_core.h           ← Mathematical foundation
└── backends/
    └── q4_kernels/        ← Per-hardware kernel implementations
```

One forward pass handles: dense transformers, Mixture-of-Experts, Multi-head Latent Attention, grouped-query attention, fused QKV tensors, and every combination.

→ [Detailed architecture](ARCHITECTURE.md) · [How the technology works](TECHNOLOGY.md)


## Features

- **Higher fidelity output** — Fused dequant+dot kernels eliminate intermediate buffers. Fewer rounding operations = output closer to the model's FP32 theoretical maximum.
- **Adaptive precision** — Shannon entropy analysis determines per-layer quantization. Simple queries run faster. Complex reasoning gets full depth. The model breathes.
- **Surgical expert loading** — MoE models load only active experts. 48× I/O reduction. Clean signal path with zero interference from unused parameters.
- **Universal model support** — LLAMA, QWEN2, PHI3, GEMMA2, DEEPSEEK, KIMI. Dense and MoE. The model changes, the protocol doesn't.
- **23 native quantization formats** — Q2_K through FP32. No format conversion. The engine speaks the model's native dialect.
- **19 hardware backends** — CPU, GPU, TPU, LPU, IPU, FPGA, DSP, WSE. One binary, every silicon.
- **305 KB binary** — Fits in L2 cache. The engine is invisible. You hear the model, not the framework.
- **Auto chat template** — ChatML, Llama 3, Mistral, Gemma, Phi-3, Kimi. Detected from GGUF metadata. Zero configuration.
- **OpenAI-compatible API** — `./inference-x model.gguf --serve` gives you `/v1/chat/completions`. Drop-in replacement.
- **Web interface** — Built-in chat UI. `python3 web/ix_server.py` and open your browser.


## API Server

```bash
./inference-x model.gguf --serve --port 8080
```

Drop-in replacement for OpenAI:

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello"}]
)
```


## Contributing

We welcome contributions:

- **Backends** — Port kernel implementations to new hardware
- **Models** — Add new architectures and quantization formats
- **Benchmarks** — Run benchmarks on diverse hardware
- **Documentation** — Tutorials, guides, translations

See [CONTRIBUTING.md](CONTRIBUTING.md) for details.


## License

[Business Source License 1.1](LICENSE) — Free for individuals, researchers, and small teams. Commercial use requires a license. Converts to open source in 2030.

See [NOTICE](NOTICE) for full terms.


## Acknowledgments

- **[Infomaniak](https://infomaniak.com)** — Swiss hosting partner
- **[Hetzner](https://hetzner.com)** — High-performance compute

---

<p align="center">
  <a href="https://inference-x.com">inference-x.com</a> ·
  <a href="https://x.com/ElmadaniSa13111">@ElmadaniSa13111</a> ·
  <a href="https://github.com/sponsors/ElmadaniS">Sponsor</a>
  <br><br>
  <em>Built in Morocco for the world.</em>
</p>