Better output from the same model. Fused computation, adaptive precision, surgical expert loading. 305 KB, 19 backends, zero dependencies. https://inference-x.com
7.9 KiB
Architecture
Design Principle
The shortest path between model weights and silicon. The cleanest signal from weights to output.
Every design decision serves two goals: route intelligence to any hardware, and preserve the model's original signal through the computation path. No abstraction that doesn't earn its keep. No layer that doesn't serve the signal path. No buffer that introduces unnecessary rounding. Subtract rather than add.
System Overview
┌─────────────────────────────────────────────────────────────┐
│ infer.cpp — Application Layer │
│ CLI parsing, chat templates, mode dispatch │
│ Modes: interactive (-i), serve (--serve), batch (-p) │
├──────────────┬──────────────┬────────────────────────────────┤
│ server.h │ fractal.h │ identity.h │
│ HTTP API │ Dynamic │ Authorship │
│ SSE stream │ precision │ verification │
│ OpenAI fmt │ Q2→FP16 │ 4-layer protect │
├──────────────┴──────────────┴────────────────────────────────┤
│ transformer_v6.h — Compute Layer │
│ Forward pass orchestration, layer iteration, KV cache │
├──────┬───────┬──────────┬───────────┬────────────────────────┤
│ attn │ moe │ expert │ gemm.h │ tokenizer.h │
│ .h │ _mla │ _mmap.h │ 23 quant │ BPE + 7 templates │
│ GQA │ .h │ surgical │ fused dot │ Llama3/Gemma/Phi3/ │
│ MLA │ route │ prefetch │ zero-copy │ Mistral/ChatML/Kimi │
├──────┴───────┴──────────┴───────────┴────────────────────────┤
│ kernel_dispatch.h — Hardware Abstraction Layer │
│ Auto-detects hardware at compile time, routes to backend │
├──────┬────────┬──────────┬──────────┬────────┬───────────────┤
│ x86 │ ARM │ CUDA │ Hexagon │ TPU │ 8 more... │
│ AVX2 │ NEON │ Metal │ Snapdrgn │ Groq │ │
│ 512 │ │ Vulkan │ │ FPGA │ │
├──────┴────────┴──────────┴──────────┴────────┴───────────────┤
│ platform.h — OS Abstraction │
│ Socket API (POSIX/Winsock), mmap, threading, RAM detection │
├──────────────────────────────────────────────────────────────┤
│ z_core.h — Type System │
│ 23 quantization formats, block structures, dequant functions │
│ gguf.h — Model Loader (multi-shard GGUF format) │
└──────────────────────────────────────────────────────────────┘
Module Reference
Application Layer
| Module | Lines | Purpose |
|---|---|---|
infer.cpp |
~570 | Entry point, CLI, mode dispatch |
runtime/server.h |
~530 | OpenAI-compatible HTTP API, SSE streaming |
runtime/fractal.h |
~320 | Dynamic precision per layer (fractal inference) |
runtime/identity.h |
~160 | Cryptographic authorship, 4-layer protection |
Compute Layer
| Module | Lines | Purpose |
|---|---|---|
runtime/transformer_v6.h |
~1200 | Forward pass, layer iteration, KV cache |
runtime/attention.h |
~800 | Multi-head attention, GQA, MLA |
runtime/moe_mla.h |
~700 | Mixture-of-Experts routing, Multi-head Latent Attention |
runtime/expert_mmap.h |
~400 | Surgical expert loading, predictive prefetch, eviction |
runtime/gemm.h |
~1500 | Fused dequant+matmul for 23 quantization formats |
runtime/tokenizer.h |
~600 | BPE tokenizer, 7 chat templates, special tokens |
runtime/kernels.h |
~400 | SIMD compute kernels (softmax, RMSNorm, RoPE) |
Hardware Layer
| Module | Lines | Purpose |
|---|---|---|
runtime/kernel_dispatch.h |
~400 | Hardware detection, backend routing |
runtime/backends.h |
~200 | Backend interface, hardware profiling |
runtime/platform.h |
~170 | Cross-platform (Linux/macOS/Windows) |
backends/q4_kernels/ |
~1500 | 19 platform-specific GEMM implementations |
Foundation
| Module | Lines | Purpose |
|---|---|---|
core/z_core.h |
~800 | Type definitions, 23 quant block structures |
core/iq_tables.h |
~200 | Importance quantization lookup tables |
runtime/gguf.h |
~1200 | GGUF model loader, multi-shard support |
Key Design Decisions
Fused Dequant+Dot
Standard approach: dequantize to FP32 buffer → matrix multiply against buffer. Two passes. One temporary allocation. Rounding errors at each boundary.
Our approach: dequantize and accumulate in a single loop iteration. One pass. Zero buffer. Fewer floating-point operations means output closer to the model's FP32 theoretical maximum.
Eliminates intermediate buffer allocation. Supports all 23 formats with hand-tuned AVX2/AVX-512 SIMD kernels. The computation path is mathematically cleaner — not just faster.
Expert mmap (Surgical Loading)
For MoE models (DeepSeek, Kimi), only 8 of 256+ experts are active per token. Loading all experts wastes 97% of I/O bandwidth — and fills the CPU cache with parameters that contribute nothing to the current token.
Expert mmap loads only active experts via memory-mapped files with predictive prefetch. Layer N's routing decision triggers prefetch for layer N+1. Inactive experts are surgically evicted via madvise(DONTNEED).
Result: 48× I/O reduction for trillion-parameter models. The signal path contains only parameters that contribute to the current answer. Nothing else exists in memory.
Fractal Inference (Adaptive Precision)
Query complexity determines layer precision. Shannon entropy of input tokens + vocabulary diversity → composite complexity score → per-layer quantization map.
Simple query ("2+2=") → early layers at Q2_K, late layers at Q4_K → 26% RAM savings, zero quality loss. Complex reasoning → all layers at base precision → maximum signal fidelity.
The model breathes. Same file. Same binary. Different depth. Precision is allocated where it contributes to signal, and removed where it only adds noise. This is information-theoretic optimization applied to inference.
Identity Integration
Author identity constants participate in the kernel dispatch hash seed. The dispatch table initialization uses these constants mathematically. Removing them changes kernel selection, which changes numerical results.
This is not DRM — it's structural attribution. The author's identity is fused with the inference path at the mathematical level.
Build System
Single-file compilation. The Makefile auto-detects:
- x86: AVX2/AVX-512 via compiler intrinsic tests
- ARM: NEON detection
- OpenMP: automatic parallelization
make → inference-x (305 KB, optimized)
make clean → remove artifacts
No CMake. No configure. No autotools. One command.