# Inference-X [![Build](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml/badge.svg)](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml) [![Release](https://img.shields.io/github/v/release/ElmadaniS/inference-x)](https://github.com/ElmadaniS/inference-x/releases) [![License](https://img.shields.io/badge/license-BSL--1.1-blue)](LICENSE) [![Binary Size](https://img.shields.io/badge/binary-305%20KB-brightgreen)](TECHNOLOGY.md) [![Backends](https://img.shields.io/badge/backends-19-orange)](ARCHITECTURE.md) **Run AI on your own computer. Private. Free. No internet.** Inference-X is a tiny file (305 KB) that lets any computer run AI models locally. It works on old laptops, phones, Raspberry Pi, and datacenters — same file, no setup. Your questions stay on your machine. Nobody sees them. **[Website](https://inference-x.com)** · **[How it works](TECHNOLOGY.md)** · **[Benchmarks](BENCHMARKS.md)** · **[Vision](VISION.md)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)** --- ## Start in 30 seconds ```bash git clone https://github.com/ElmadaniS/inference-x cd inference-x && make ./inference-x model.gguf ``` That's it. Download a `.gguf` model from [HuggingFace](https://huggingface.co/models?sort=trending&search=gguf), run the command, talk to AI. No account. No API key. No internet. Add `--serve 8080` to get a web interface at `localhost:8080`. --- ## What can your computer run? | Your RAM | Models you can run | What it can do | |---|---|---| | **2 GB** | SmolLM2 135M | Simple assistant, quick answers | | **4 GB** | Phi-3 Mini 3.8B, Llama 3.2 3B | Smart conversations, code help, translations | | **8 GB** | Mistral 7B, Llama 3.1 8B | Creative writing, analysis, reasoning | | **16 GB** | DeepSeek R1 14B | Advanced reasoning, expert-level answers | | **32 GB** | Qwen 2.5 32B | Professional-grade AI | | **64 GB** | Llama 3.1 70B, DeepSeek V3 MoE | Frontier performance, locally | Every model runs privately, offline, with no subscription. --- ## Why local AI matters When you use AI online, your words travel to a server in another country. Someone can read them. You pay per word. The service can shut down. With Inference-X, your questions stay on your desk. The answer is computed by your own processor. Nothing leaves. Nothing is stored. It works without internet. It's free forever. --- ## What makes it different Most inference engines add layers between the model and the hardware: frameworks, runtime allocators, intermediate buffers. Each layer degrades the model's signal. Inference-X removes those layers. **Fused computation** — Dequantization and matrix multiply happen in a single instruction loop. No intermediate FP32 buffer. Output closer to the model's theoretical maximum. **Adaptive precision** — Each query is analyzed before inference. Simple questions get compressed early layers and full-precision decision layers. Complex reasoning gets full precision throughout. **Surgical expert loading** — For Mixture-of-Experts models, only active experts exist in memory. A 1-trillion-parameter model runs on 64 GB of RAM. The result: **the same model produces better output through a cleaner computation path.** A smaller model through Inference-X can match a larger model through a conventional engine. → [Full technical explanation](TECHNOLOGY.md) --- ## How it works TCP/IP routes data packets to any network. Inference-X routes intelligence to any silicon. One function call enters `kernel_dispatch.h`. On the other side: CPU, GPU, TPU, LPU, IPU, FPGA, DSP, or WSE. The model runs. The answer comes back. ``` Model (any GGUF) → Inference-X (305 KB) → Silicon (any of 19 backends) → Response ``` ``` Architecture: infer.cpp (570 lines) — Orchestrator. Chat templates. Server mode. transformer_v6.h — Forward pass. Dense + MoE + MLA unified. kernel_dispatch.h — Routes GEMM to the right silicon. moe_mla.h — Expert selection. Prefetch. Eviction. gemm.h — Fused dequant × matmul kernels. backends.h — 19 hardware targets. One interface. ``` 12,571 lines of C++17. 6 architectures (Llama, Qwen2, Gemma2, Phi, DeepSeek MoE, MLA). 23 quantization formats. One binary. --- ## Benchmarks AMD EPYC Rome · 17 GB RAM · 6 cores · CPU-only · €20/month server | Model | Params | Quant | tok/s | Prefill | |---|---|---|---|---| | SmolLM2 | 135M | Q8_0 | **130.23** | 87 ms | | Qwen 2.5 | 3B | Q4_K_M | **3.85** | 16.5 s | | Llama 3.2 | 3B | Q4_K_M | **3.82** | 3.8 s | | Mistral 7B | 7B | Q4_K_M | **2.06** | 39.2 s | | Llama 3.1 | 8B | Q4_K_M | **1.75** | 43.0 s | | DeepSeek R1 | 14B | Q4_K_M | **0.97** | 74.1 s | 9 models · 4 architectures · Same binary · Zero configuration → [Full benchmarks](BENCHMARKS.md) --- ## Supported Hardware | Backend | Target | Status | |---|---|---| | CPU AVX2/512 | Intel, AMD | ✅ Production | | CUDA | NVIDIA GPU | ✅ Production | | ROCm | AMD GPU | ✅ Production | | Metal | Apple Silicon | ✅ Production | | Vulkan | Cross-platform | ✅ Production | | ARM NEON | ARM (Pi, phones) | ✅ Production | | Snapdragon | Qualcomm | 🔶 Ready | | Hexagon HVX | Qualcomm DSP | 🔶 Ready | | TPU | Google | 🔶 Ready | | Inferentia | AWS | 🔶 Ready | | Gaudi | Intel HPU | 🔶 Ready | | Maia | Microsoft | 🔶 Ready | | SambaNova RDU | SambaNova | 🔶 Ready | | Graphcore IPU | Graphcore | 🔶 Ready | | Groq LPU | Groq | 🔶 Ready | | Cerebras WSE | 850K cores | 🔶 Ready | | FPGA | Xilinx | 🔶 Ready | | WebGPU | Browser | 🔶 Ready | | OpenCL | Universal | 🔶 Ready | The Makefile detects your hardware. You don't configure it. --- ## API Server Start with `--serve 8080`. OpenAI-compatible API. Any client library works. ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="none") resp = client.chat.completions.create( model="local", messages=[{"role": "user", "content": "Hello!"}], stream=True ) ``` Endpoints: `POST /v1/chat/completions` · `POST /v1/completions` · `GET /v1/models` · `GET /health` --- ## Features - **Universal GGUF** — Any model, any architecture, auto-detected from metadata - **Chat templates** — 7 formats auto-detected (Llama, ChatML, Alpaca, Gemma, Phi, Mistral, DeepSeek) - **Multi-EOS** — Correct stop tokens for every architecture - **Server mode** — OpenAI-compatible API, streaming, health check - **Air-gapped** — No network calls during inference. No telemetry. Ever. - **Zero configuration** — Download a model, run it. Templates, tokens, architecture: auto. --- ## Contributing See [CONTRIBUTING.md](CONTRIBUTING.md). Run `make` to build. Run `make test` to test. Submit a PR. We welcome contributions from everyone, regardless of experience level. If you're new to open source, look for issues tagged `good first issue`. --- ## License [BSL-1.1](LICENSE) — Business Source License **Free for**: individuals, researchers, students, open-source projects, organizations under $1M revenue. **Change date**: February 12, 2030 → [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) After 2030, everything becomes fully open source. Patents remain protected. --- ## Acknowledgments Built in Morocco for the world by [Salka Elmadani](https://x.com/ElmadaniSa13111). > *In the Anti-Atlas, our ancestors built khettaras — underground water channels that deliver pure water to villages without pumps, without electricity. The water arrives cleaner than any treated supply because the path itself is the filter. Inference-X works the same way: the shortest path produces the cleanest signal.* **[Website](https://inference-x.com)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)** · **[Contact](mailto:Elmadani.SALKA@proton.me)**