inference-x/README.md

# Inference-X

[![Build](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml/badge.svg)](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml)
[![Release](https://img.shields.io/github/v/release/ElmadaniS/inference-x)](https://github.com/ElmadaniS/inference-x/releases)
[![License](https://img.shields.io/badge/license-BSL--1.1-blue)](LICENSE)
[![Binary Size](https://img.shields.io/badge/binary-305%20KB-brightgreen)](TECHNOLOGY.md)
[![Backends](https://img.shields.io/badge/backends-19-orange)](ARCHITECTURE.md)

**Run AI on your own computer. Private. Free. No internet.**

Inference-X is a tiny file (305 KB) that lets any computer run AI models locally. It works on old laptops, phones, Raspberry Pi, and datacenters — same file, no setup. Your questions stay on your machine. Nobody sees them.

**[Website](https://inference-x.com)** · **[How it works](TECHNOLOGY.md)** · **[Benchmarks](BENCHMARKS.md)** · **[Vision](VISION.md)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)**

---

## Start in 30 seconds

```bash
git clone https://github.com/ElmadaniS/inference-x
cd inference-x && make
./inference-x model.gguf
```

That's it. Download a `.gguf` model from [HuggingFace](https://huggingface.co/models?sort=trending&search=gguf), run the command, talk to AI. No account. No API key. No internet.

Add `--serve 8080` to get a web interface at `localhost:8080`.

---

## What can your computer run?

| Your RAM | Models you can run | What it can do |
|---|---|---|
| **2 GB** | SmolLM2 135M | Simple assistant, quick answers |
| **4 GB** | Phi-3 Mini 3.8B, Llama 3.2 3B | Smart conversations, code help, translations |
| **8 GB** | Mistral 7B, Llama 3.1 8B | Creative writing, analysis, reasoning |
| **16 GB** | DeepSeek R1 14B | Advanced reasoning, expert-level answers |
| **32 GB** | Qwen 2.5 32B | Professional-grade AI |
| **64 GB** | Llama 3.1 70B, DeepSeek V3 MoE | Frontier performance, locally |

Every model runs privately, offline, with no subscription.

---

## Why local AI matters

When you use AI online, your words travel to a server in another country. Someone can read them. You pay per word. The service can shut down.

With Inference-X, your questions stay on your desk. The answer is computed by your own processor. Nothing leaves. Nothing is stored. It works without internet. It's free forever.

---

## What makes it different

Most inference engines add layers between the model and the hardware: frameworks, runtime allocators, intermediate buffers. Each layer degrades the model's signal.

Inference-X removes those layers.

**Fused computation** — Dequantization and matrix multiply happen in a single instruction loop. No intermediate FP32 buffer. Output closer to the model's theoretical maximum.

**Adaptive precision** — Each query is analyzed before inference. Simple questions get compressed early layers and full-precision decision layers. Complex reasoning gets full precision throughout.

**Surgical expert loading** — For Mixture-of-Experts models, only active experts exist in memory. A 1-trillion-parameter model runs on 64 GB of RAM.

The result: **the same model produces better output through a cleaner computation path.** A smaller model through Inference-X can match a larger model through a conventional engine.

→ [Full technical explanation](TECHNOLOGY.md)

---

## How it works

TCP/IP routes data packets to any network. Inference-X routes intelligence to any silicon.

One function call enters `kernel_dispatch.h`. On the other side: CPU, GPU, TPU, LPU, IPU, FPGA, DSP, or WSE. The model runs. The answer comes back.

```
Model (any GGUF) → Inference-X (305 KB) → Silicon (any of 19 backends) → Response
```

```
Architecture:
  infer.cpp (570 lines)          — Orchestrator. Chat templates. Server mode.
  transformer_v6.h               — Forward pass. Dense + MoE + MLA unified.
  kernel_dispatch.h              — Routes GEMM to the right silicon.
  moe_mla.h                      — Expert selection. Prefetch. Eviction.
  gemm.h                         — Fused dequant × matmul kernels.
  backends.h                     — 19 hardware targets. One interface.
```

12,571 lines of C++17. 6 architectures (Llama, Qwen2, Gemma2, Phi, DeepSeek MoE, MLA). 23 quantization formats. One binary.

---

## Benchmarks

AMD EPYC Rome · 17 GB RAM · 6 cores · CPU-only · €20/month server

| Model | Params | Quant | tok/s | Prefill |
|---|---|---|---|---|
| SmolLM2 | 135M | Q8_0 | **130.23** | 87 ms |
| Qwen 2.5 | 3B | Q4_K_M | **3.85** | 16.5 s |
| Llama 3.2 | 3B | Q4_K_M | **3.82** | 3.8 s |
| Mistral 7B | 7B | Q4_K_M | **2.06** | 39.2 s |
| Llama 3.1 | 8B | Q4_K_M | **1.75** | 43.0 s |
| DeepSeek R1 | 14B | Q4_K_M | **0.97** | 74.1 s |

9 models · 4 architectures · Same binary · Zero configuration

→ [Full benchmarks](BENCHMARKS.md)

---

## Supported Hardware

| Backend | Target | Status |
|---|---|---|
| CPU AVX2/512 | Intel, AMD | ✅ Production |
| CUDA | NVIDIA GPU | ✅ Production |
| ROCm | AMD GPU | ✅ Production |
| Metal | Apple Silicon | ✅ Production |
| Vulkan | Cross-platform | ✅ Production |
| ARM NEON | ARM (Pi, phones) | ✅ Production |
| Snapdragon | Qualcomm | 🔶 Ready |
| Hexagon HVX | Qualcomm DSP | 🔶 Ready |
| TPU | Google | 🔶 Ready |
| Inferentia | AWS | 🔶 Ready |
| Gaudi | Intel HPU | 🔶 Ready |
| Maia | Microsoft | 🔶 Ready |
| SambaNova RDU | SambaNova | 🔶 Ready |
| Graphcore IPU | Graphcore | 🔶 Ready |
| Groq LPU | Groq | 🔶 Ready |
| Cerebras WSE | 850K cores | 🔶 Ready |
| FPGA | Xilinx | 🔶 Ready |
| WebGPU | Browser | 🔶 Ready |
| OpenCL | Universal | 🔶 Ready |

The Makefile detects your hardware. You don't configure it.

---

## API Server

Start with `--serve 8080`. OpenAI-compatible API. Any client library works.

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
resp = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)
```

Endpoints: `POST /v1/chat/completions` · `POST /v1/completions` · `GET /v1/models` · `GET /health`

---

## Features

- **Universal GGUF** — Any model, any architecture, auto-detected from metadata
- **Chat templates** — 7 formats auto-detected (Llama, ChatML, Alpaca, Gemma, Phi, Mistral, DeepSeek)
- **Multi-EOS** — Correct stop tokens for every architecture
- **Server mode** — OpenAI-compatible API, streaming, health check
- **Air-gapped** — No network calls during inference. No telemetry. Ever.
- **Zero configuration** — Download a model, run it. Templates, tokens, architecture: auto.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Run `make` to build. Run `make test` to test. Submit a PR.

We welcome contributions from everyone, regardless of experience level. If you're new to open source, look for issues tagged `good first issue`.

---

## License

[BSL-1.1](LICENSE) — Business Source License

**Free for**: individuals, researchers, students, open-source projects, organizations under $1M revenue.

**Change date**: February 12, 2030 → [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

After 2030, everything becomes fully open source. Patents remain protected.

---

## Acknowledgments

Built in Morocco for the world by [Salka Elmadani](https://x.com/ElmadaniSa13111).

> *In the Anti-Atlas, our ancestors built khettaras — underground water channels that deliver pure water to villages without pumps, without electricity. The water arrives cleaner than any treated supply because the path itself is the filter. Inference-X works the same way: the shortest path produces the cleanest signal.*

**[Website](https://inference-x.com)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)** · **[Contact](mailto:Elmadani.SALKA@proton.me)**