199 lines
7.7 KiB
Markdown
199 lines
7.7 KiB
Markdown
# Inference-X
|
||
|
||
[](https://github.com/ElmadaniS/inference-x/actions/workflows/build.yml)
|
||
[](https://github.com/ElmadaniS/inference-x/releases)
|
||
[](LICENSE)
|
||
[](TECHNOLOGY.md)
|
||
[](ARCHITECTURE.md)
|
||
|
||
**Run AI on your own computer. Private. Free. No internet.**
|
||
|
||
Inference-X is a tiny file (305 KB) that lets any computer run AI models locally. It works on old laptops, phones, Raspberry Pi, and datacenters — same file, no setup. Your questions stay on your machine. Nobody sees them.
|
||
|
||
**[Website](https://inference-x.com)** · **[How it works](TECHNOLOGY.md)** · **[Benchmarks](BENCHMARKS.md)** · **[Vision](VISION.md)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)**
|
||
|
||
---
|
||
|
||
## Start in 30 seconds
|
||
|
||
```bash
|
||
git clone https://github.com/ElmadaniS/inference-x
|
||
cd inference-x && make
|
||
./inference-x model.gguf
|
||
```
|
||
|
||
That's it. Download a `.gguf` model from [HuggingFace](https://huggingface.co/models?sort=trending&search=gguf), run the command, talk to AI. No account. No API key. No internet.
|
||
|
||
Add `--serve 8080` to get a web interface at `localhost:8080`.
|
||
|
||
---
|
||
|
||
## What can your computer run?
|
||
|
||
| Your RAM | Models you can run | What it can do |
|
||
|---|---|---|
|
||
| **2 GB** | SmolLM2 135M | Simple assistant, quick answers |
|
||
| **4 GB** | Phi-3 Mini 3.8B, Llama 3.2 3B | Smart conversations, code help, translations |
|
||
| **8 GB** | Mistral 7B, Llama 3.1 8B | Creative writing, analysis, reasoning |
|
||
| **16 GB** | DeepSeek R1 14B | Advanced reasoning, expert-level answers |
|
||
| **32 GB** | Qwen 2.5 32B | Professional-grade AI |
|
||
| **64 GB** | Llama 3.1 70B, DeepSeek V3 MoE | Frontier performance, locally |
|
||
|
||
Every model runs privately, offline, with no subscription.
|
||
|
||
---
|
||
|
||
## Why local AI matters
|
||
|
||
When you use AI online, your words travel to a server in another country. Someone can read them. You pay per word. The service can shut down.
|
||
|
||
With Inference-X, your questions stay on your desk. The answer is computed by your own processor. Nothing leaves. Nothing is stored. It works without internet. It's free forever.
|
||
|
||
---
|
||
|
||
## What makes it different
|
||
|
||
Most inference engines add layers between the model and the hardware: frameworks, runtime allocators, intermediate buffers. Each layer degrades the model's signal.
|
||
|
||
Inference-X removes those layers.
|
||
|
||
**Fused computation** — Dequantization and matrix multiply happen in a single instruction loop. No intermediate FP32 buffer. Output closer to the model's theoretical maximum.
|
||
|
||
**Adaptive precision** — Each query is analyzed before inference. Simple questions get compressed early layers and full-precision decision layers. Complex reasoning gets full precision throughout.
|
||
|
||
**Surgical expert loading** — For Mixture-of-Experts models, only active experts exist in memory. A 1-trillion-parameter model runs on 64 GB of RAM.
|
||
|
||
The result: **the same model produces better output through a cleaner computation path.** A smaller model through Inference-X can match a larger model through a conventional engine.
|
||
|
||
→ [Full technical explanation](TECHNOLOGY.md)
|
||
|
||
---
|
||
|
||
## How it works
|
||
|
||
TCP/IP routes data packets to any network. Inference-X routes intelligence to any silicon.
|
||
|
||
One function call enters `kernel_dispatch.h`. On the other side: CPU, GPU, TPU, LPU, IPU, FPGA, DSP, or WSE. The model runs. The answer comes back.
|
||
|
||
```
|
||
Model (any GGUF) → Inference-X (305 KB) → Silicon (any of 19 backends) → Response
|
||
```
|
||
|
||
```
|
||
Architecture:
|
||
infer.cpp (570 lines) — Orchestrator. Chat templates. Server mode.
|
||
transformer_v6.h — Forward pass. Dense + MoE + MLA unified.
|
||
kernel_dispatch.h — Routes GEMM to the right silicon.
|
||
moe_mla.h — Expert selection. Prefetch. Eviction.
|
||
gemm.h — Fused dequant × matmul kernels.
|
||
backends.h — 19 hardware targets. One interface.
|
||
```
|
||
|
||
12,571 lines of C++17. 6 architectures (Llama, Qwen2, Gemma2, Phi, DeepSeek MoE, MLA). 23 quantization formats. One binary.
|
||
|
||
---
|
||
|
||
## Benchmarks
|
||
|
||
AMD EPYC Rome · 17 GB RAM · 6 cores · CPU-only · €20/month server
|
||
|
||
| Model | Params | Quant | tok/s | Prefill |
|
||
|---|---|---|---|---|
|
||
| SmolLM2 | 135M | Q8_0 | **130.23** | 87 ms |
|
||
| Qwen 2.5 | 3B | Q4_K_M | **3.85** | 16.5 s |
|
||
| Llama 3.2 | 3B | Q4_K_M | **3.82** | 3.8 s |
|
||
| Mistral 7B | 7B | Q4_K_M | **2.06** | 39.2 s |
|
||
| Llama 3.1 | 8B | Q4_K_M | **1.75** | 43.0 s |
|
||
| DeepSeek R1 | 14B | Q4_K_M | **0.97** | 74.1 s |
|
||
|
||
9 models · 4 architectures · Same binary · Zero configuration
|
||
|
||
→ [Full benchmarks](BENCHMARKS.md)
|
||
|
||
---
|
||
|
||
## Supported Hardware
|
||
|
||
| Backend | Target | Status |
|
||
|---|---|---|
|
||
| CPU AVX2/512 | Intel, AMD | ✅ Production |
|
||
| CUDA | NVIDIA GPU | ✅ Production |
|
||
| ROCm | AMD GPU | ✅ Production |
|
||
| Metal | Apple Silicon | ✅ Production |
|
||
| Vulkan | Cross-platform | ✅ Production |
|
||
| ARM NEON | ARM (Pi, phones) | ✅ Production |
|
||
| Snapdragon | Qualcomm | 🔶 Ready |
|
||
| Hexagon HVX | Qualcomm DSP | 🔶 Ready |
|
||
| TPU | Google | 🔶 Ready |
|
||
| Inferentia | AWS | 🔶 Ready |
|
||
| Gaudi | Intel HPU | 🔶 Ready |
|
||
| Maia | Microsoft | 🔶 Ready |
|
||
| SambaNova RDU | SambaNova | 🔶 Ready |
|
||
| Graphcore IPU | Graphcore | 🔶 Ready |
|
||
| Groq LPU | Groq | 🔶 Ready |
|
||
| Cerebras WSE | 850K cores | 🔶 Ready |
|
||
| FPGA | Xilinx | 🔶 Ready |
|
||
| WebGPU | Browser | 🔶 Ready |
|
||
| OpenCL | Universal | 🔶 Ready |
|
||
|
||
The Makefile detects your hardware. You don't configure it.
|
||
|
||
---
|
||
|
||
## API Server
|
||
|
||
Start with `--serve 8080`. OpenAI-compatible API. Any client library works.
|
||
|
||
```python
|
||
from openai import OpenAI
|
||
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
|
||
resp = client.chat.completions.create(
|
||
model="local",
|
||
messages=[{"role": "user", "content": "Hello!"}],
|
||
stream=True
|
||
)
|
||
```
|
||
|
||
Endpoints: `POST /v1/chat/completions` · `POST /v1/completions` · `GET /v1/models` · `GET /health`
|
||
|
||
---
|
||
|
||
## Features
|
||
|
||
- **Universal GGUF** — Any model, any architecture, auto-detected from metadata
|
||
- **Chat templates** — 7 formats auto-detected (Llama, ChatML, Alpaca, Gemma, Phi, Mistral, DeepSeek)
|
||
- **Multi-EOS** — Correct stop tokens for every architecture
|
||
- **Server mode** — OpenAI-compatible API, streaming, health check
|
||
- **Air-gapped** — No network calls during inference. No telemetry. Ever.
|
||
- **Zero configuration** — Download a model, run it. Templates, tokens, architecture: auto.
|
||
|
||
---
|
||
|
||
## Contributing
|
||
|
||
See [CONTRIBUTING.md](CONTRIBUTING.md). Run `make` to build. Run `make test` to test. Submit a PR.
|
||
|
||
We welcome contributions from everyone, regardless of experience level. If you're new to open source, look for issues tagged `good first issue`.
|
||
|
||
---
|
||
|
||
## License
|
||
|
||
[BSL-1.1](LICENSE) — Business Source License
|
||
|
||
**Free for**: individuals, researchers, students, open-source projects, organizations under $1M revenue.
|
||
|
||
**Change date**: February 12, 2030 → [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
||
|
||
After 2030, everything becomes fully open source. Patents remain protected.
|
||
|
||
---
|
||
|
||
## Acknowledgments
|
||
|
||
Built in Morocco for the world by [Salka Elmadani](https://x.com/ElmadaniSa13111).
|
||
|
||
> *In the Anti-Atlas, our ancestors built khettaras — underground water channels that deliver pure water to villages without pumps, without electricity. The water arrives cleaner than any treated supply because the path itself is the filter. Inference-X works the same way: the shortest path produces the cleanest signal.*
|
||
|
||
**[Website](https://inference-x.com)** · **[Sponsor](https://github.com/sponsors/ElmadaniS)** · **[Contact](mailto:Elmadani.SALKA@proton.me)**
|