inference-x/README.md
Salka Elmadani 1208c6d521
Some checks are pending
Build / build-linux (push) Waiting to run
Build / build-macos (push) Waiting to run
README: accessible to everyone, technical depth preserved
2026-02-23 10:39:05 +00:00

199 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Inference-X
[![Build](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml/badge.svg)](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml)
[![Release](https://img.shields.io/github/v/release/ElmadaniS/inference-x)](https://github.com/ElmadaniS/inference-x/releases)
[![License](https://img.shields.io/badge/license-BSL--1.1-blue)](LICENSE)
[![Binary Size](https://img.shields.io/badge/binary-305%20KB-brightgreen)](TECHNOLOGY.md)
[![Backends](https://img.shields.io/badge/backends-19-orange)](ARCHITECTURE.md)
**Run AI on your own computer. Private. Free. No internet.**
Inference-X is a tiny file (305 KB) that lets any computer run AI models locally. It works on old laptops, phones, Raspberry Pi, and datacenters — same file, no setup. Your questions stay on your machine. Nobody sees them.
**[Website](https://inference-x.com)** · **[How it works](TECHNOLOGY.md)** · **[Benchmarks](BENCHMARKS.md)** · **[Vision](VISION.md)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)**
---
## Start in 30 seconds
```bash
git clone https://github.com/ElmadaniS/inference-x
cd inference-x && make
./inference-x model.gguf
```
That's it. Download a `.gguf` model from [HuggingFace](https://huggingface.co/models?sort=trending&search=gguf), run the command, talk to AI. No account. No API key. No internet.
Add `--serve 8080` to get a web interface at `localhost:8080`.
---
## What can your computer run?
| Your RAM | Models you can run | What it can do |
|---|---|---|
| **2 GB** | SmolLM2 135M | Simple assistant, quick answers |
| **4 GB** | Phi-3 Mini 3.8B, Llama 3.2 3B | Smart conversations, code help, translations |
| **8 GB** | Mistral 7B, Llama 3.1 8B | Creative writing, analysis, reasoning |
| **16 GB** | DeepSeek R1 14B | Advanced reasoning, expert-level answers |
| **32 GB** | Qwen 2.5 32B | Professional-grade AI |
| **64 GB** | Llama 3.1 70B, DeepSeek V3 MoE | Frontier performance, locally |
Every model runs privately, offline, with no subscription.
---
## Why local AI matters
When you use AI online, your words travel to a server in another country. Someone can read them. You pay per word. The service can shut down.
With Inference-X, your questions stay on your desk. The answer is computed by your own processor. Nothing leaves. Nothing is stored. It works without internet. It's free forever.
---
## What makes it different
Most inference engines add layers between the model and the hardware: frameworks, runtime allocators, intermediate buffers. Each layer degrades the model's signal.
Inference-X removes those layers.
**Fused computation** — Dequantization and matrix multiply happen in a single instruction loop. No intermediate FP32 buffer. Output closer to the model's theoretical maximum.
**Adaptive precision** — Each query is analyzed before inference. Simple questions get compressed early layers and full-precision decision layers. Complex reasoning gets full precision throughout.
**Surgical expert loading** — For Mixture-of-Experts models, only active experts exist in memory. A 1-trillion-parameter model runs on 64 GB of RAM.
The result: **the same model produces better output through a cleaner computation path.** A smaller model through Inference-X can match a larger model through a conventional engine.
→ [Full technical explanation](TECHNOLOGY.md)
---
## How it works
TCP/IP routes data packets to any network. Inference-X routes intelligence to any silicon.
One function call enters `kernel_dispatch.h`. On the other side: CPU, GPU, TPU, LPU, IPU, FPGA, DSP, or WSE. The model runs. The answer comes back.
```
Model (any GGUF) → Inference-X (305 KB) → Silicon (any of 19 backends) → Response
```
```
Architecture:
infer.cpp (570 lines) — Orchestrator. Chat templates. Server mode.
transformer_v6.h — Forward pass. Dense + MoE + MLA unified.
kernel_dispatch.h — Routes GEMM to the right silicon.
moe_mla.h — Expert selection. Prefetch. Eviction.
gemm.h — Fused dequant × matmul kernels.
backends.h — 19 hardware targets. One interface.
```
12,571 lines of C++17. 6 architectures (Llama, Qwen2, Gemma2, Phi, DeepSeek MoE, MLA). 23 quantization formats. One binary.
---
## Benchmarks
AMD EPYC Rome · 17 GB RAM · 6 cores · CPU-only · €20/month server
| Model | Params | Quant | tok/s | Prefill |
|---|---|---|---|---|
| SmolLM2 | 135M | Q8_0 | **130.23** | 87 ms |
| Qwen 2.5 | 3B | Q4_K_M | **3.85** | 16.5 s |
| Llama 3.2 | 3B | Q4_K_M | **3.82** | 3.8 s |
| Mistral 7B | 7B | Q4_K_M | **2.06** | 39.2 s |
| Llama 3.1 | 8B | Q4_K_M | **1.75** | 43.0 s |
| DeepSeek R1 | 14B | Q4_K_M | **0.97** | 74.1 s |
9 models · 4 architectures · Same binary · Zero configuration
→ [Full benchmarks](BENCHMARKS.md)
---
## Supported Hardware
| Backend | Target | Status |
|---|---|---|
| CPU AVX2/512 | Intel, AMD | ✅ Production |
| CUDA | NVIDIA GPU | ✅ Production |
| ROCm | AMD GPU | ✅ Production |
| Metal | Apple Silicon | ✅ Production |
| Vulkan | Cross-platform | ✅ Production |
| ARM NEON | ARM (Pi, phones) | ✅ Production |
| Snapdragon | Qualcomm | 🔶 Ready |
| Hexagon HVX | Qualcomm DSP | 🔶 Ready |
| TPU | Google | 🔶 Ready |
| Inferentia | AWS | 🔶 Ready |
| Gaudi | Intel HPU | 🔶 Ready |
| Maia | Microsoft | 🔶 Ready |
| SambaNova RDU | SambaNova | 🔶 Ready |
| Graphcore IPU | Graphcore | 🔶 Ready |
| Groq LPU | Groq | 🔶 Ready |
| Cerebras WSE | 850K cores | 🔶 Ready |
| FPGA | Xilinx | 🔶 Ready |
| WebGPU | Browser | 🔶 Ready |
| OpenCL | Universal | 🔶 Ready |
The Makefile detects your hardware. You don't configure it.
---
## API Server
Start with `--serve 8080`. OpenAI-compatible API. Any client library works.
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
resp = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
```
Endpoints: `POST /v1/chat/completions` · `POST /v1/completions` · `GET /v1/models` · `GET /health`
---
## Features
- **Universal GGUF** — Any model, any architecture, auto-detected from metadata
- **Chat templates** — 7 formats auto-detected (Llama, ChatML, Alpaca, Gemma, Phi, Mistral, DeepSeek)
- **Multi-EOS** — Correct stop tokens for every architecture
- **Server mode** — OpenAI-compatible API, streaming, health check
- **Air-gapped** — No network calls during inference. No telemetry. Ever.
- **Zero configuration** — Download a model, run it. Templates, tokens, architecture: auto.
---
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md). Run `make` to build. Run `make test` to test. Submit a PR.
We welcome contributions from everyone, regardless of experience level. If you're new to open source, look for issues tagged `good first issue`.
---
## License
[BSL-1.1](LICENSE) — Business Source License
**Free for**: individuals, researchers, students, open-source projects, organizations under $1M revenue.
**Change date**: February 12, 2030 → [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
After 2030, everything becomes fully open source. Patents remain protected.
---
## Acknowledgments
Built in Morocco for the world by [Salka Elmadani](https://x.com/ElmadaniSa13111).
> *In the Anti-Atlas, our ancestors built khettaras — underground water channels that deliver pure water to villages without pumps, without electricity. The water arrives cleaner than any treated supply because the path itself is the filter. Inference-X works the same way: the shortest path produces the cleanest signal.*
**[Website](https://inference-x.com)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)** · **[Contact](mailto:Elmadani.SALKA@proton.me)**