inference-x/ARCHITECTURE.md
Salka Elmadani ec36668cf5 Inference-X v1.0 — Universal AI Inference Engine
Better output from the same model. Fused computation, adaptive precision,
surgical expert loading. 305 KB, 19 backends, zero dependencies.

https://inference-x.com
2026-02-23 07:10:47 +00:00

7.9 KiB
Raw Blame History

Architecture

Design Principle

The shortest path between model weights and silicon. The cleanest signal from weights to output.

Every design decision serves two goals: route intelligence to any hardware, and preserve the model's original signal through the computation path. No abstraction that doesn't earn its keep. No layer that doesn't serve the signal path. No buffer that introduces unnecessary rounding. Subtract rather than add.

System Overview

┌─────────────────────────────────────────────────────────────┐
│  infer.cpp — Application Layer                               │
│  CLI parsing, chat templates, mode dispatch                  │
│  Modes: interactive (-i), serve (--serve), batch (-p)        │
├──────────────┬──────────────┬────────────────────────────────┤
│  server.h    │  fractal.h   │  identity.h                    │
│  HTTP API    │  Dynamic     │  Authorship                    │
│  SSE stream  │  precision   │  verification                  │
│  OpenAI fmt  │  Q2→FP16     │  4-layer protect               │
├──────────────┴──────────────┴────────────────────────────────┤
│  transformer_v6.h — Compute Layer                            │
│  Forward pass orchestration, layer iteration, KV cache       │
├──────┬───────┬──────────┬───────────┬────────────────────────┤
│ attn │ moe   │ expert   │ gemm.h    │ tokenizer.h            │
│ .h   │ _mla  │ _mmap.h  │ 23 quant  │ BPE + 7 templates     │
│ GQA  │ .h    │ surgical │ fused dot │ Llama3/Gemma/Phi3/     │
│ MLA  │ route │ prefetch │ zero-copy │ Mistral/ChatML/Kimi    │
├──────┴───────┴──────────┴───────────┴────────────────────────┤
│  kernel_dispatch.h — Hardware Abstraction Layer               │
│  Auto-detects hardware at compile time, routes to backend     │
├──────┬────────┬──────────┬──────────┬────────┬───────────────┤
│ x86  │ ARM    │ CUDA     │ Hexagon  │ TPU    │ 8 more...    │
│ AVX2 │ NEON   │ Metal    │ Snapdrgn │ Groq   │              │
│ 512  │        │ Vulkan   │          │ FPGA   │              │
├──────┴────────┴──────────┴──────────┴────────┴───────────────┤
│  platform.h — OS Abstraction                                  │
│  Socket API (POSIX/Winsock), mmap, threading, RAM detection   │
├──────────────────────────────────────────────────────────────┤
│  z_core.h — Type System                                       │
│  23 quantization formats, block structures, dequant functions  │
│  gguf.h — Model Loader (multi-shard GGUF format)             │
└──────────────────────────────────────────────────────────────┘

Module Reference

Application Layer

Module Lines Purpose
infer.cpp ~570 Entry point, CLI, mode dispatch
runtime/server.h ~530 OpenAI-compatible HTTP API, SSE streaming
runtime/fractal.h ~320 Dynamic precision per layer (fractal inference)
runtime/identity.h ~160 Cryptographic authorship, 4-layer protection

Compute Layer

Module Lines Purpose
runtime/transformer_v6.h ~1200 Forward pass, layer iteration, KV cache
runtime/attention.h ~800 Multi-head attention, GQA, MLA
runtime/moe_mla.h ~700 Mixture-of-Experts routing, Multi-head Latent Attention
runtime/expert_mmap.h ~400 Surgical expert loading, predictive prefetch, eviction
runtime/gemm.h ~1500 Fused dequant+matmul for 23 quantization formats
runtime/tokenizer.h ~600 BPE tokenizer, 7 chat templates, special tokens
runtime/kernels.h ~400 SIMD compute kernels (softmax, RMSNorm, RoPE)

Hardware Layer

Module Lines Purpose
runtime/kernel_dispatch.h ~400 Hardware detection, backend routing
runtime/backends.h ~200 Backend interface, hardware profiling
runtime/platform.h ~170 Cross-platform (Linux/macOS/Windows)
backends/q4_kernels/ ~1500 19 platform-specific GEMM implementations

Foundation

Module Lines Purpose
core/z_core.h ~800 Type definitions, 23 quant block structures
core/iq_tables.h ~200 Importance quantization lookup tables
runtime/gguf.h ~1200 GGUF model loader, multi-shard support

Key Design Decisions

Fused Dequant+Dot

Standard approach: dequantize to FP32 buffer → matrix multiply against buffer. Two passes. One temporary allocation. Rounding errors at each boundary.

Our approach: dequantize and accumulate in a single loop iteration. One pass. Zero buffer. Fewer floating-point operations means output closer to the model's FP32 theoretical maximum.

Eliminates intermediate buffer allocation. Supports all 23 formats with hand-tuned AVX2/AVX-512 SIMD kernels. The computation path is mathematically cleaner — not just faster.

Expert mmap (Surgical Loading)

For MoE models (DeepSeek, Kimi), only 8 of 256+ experts are active per token. Loading all experts wastes 97% of I/O bandwidth — and fills the CPU cache with parameters that contribute nothing to the current token.

Expert mmap loads only active experts via memory-mapped files with predictive prefetch. Layer N's routing decision triggers prefetch for layer N+1. Inactive experts are surgically evicted via madvise(DONTNEED).

Result: 48× I/O reduction for trillion-parameter models. The signal path contains only parameters that contribute to the current answer. Nothing else exists in memory.

Fractal Inference (Adaptive Precision)

Query complexity determines layer precision. Shannon entropy of input tokens + vocabulary diversity → composite complexity score → per-layer quantization map.

Simple query ("2+2=") → early layers at Q2_K, late layers at Q4_K → 26% RAM savings, zero quality loss. Complex reasoning → all layers at base precision → maximum signal fidelity.

The model breathes. Same file. Same binary. Different depth. Precision is allocated where it contributes to signal, and removed where it only adds noise. This is information-theoretic optimization applied to inference.

Identity Integration

Author identity constants participate in the kernel dispatch hash seed. The dispatch table initialization uses these constants mathematically. Removing them changes kernel selection, which changes numerical results.

This is not DRM — it's structural attribution. The author's identity is fused with the inference path at the mathematical level.

Build System

Single-file compilation. The Makefile auto-detects:

  • x86: AVX2/AVX-512 via compiler intrinsic tests
  • ARM: NEON detection
  • OpenMP: automatic parallelization
make        → inference-x (305 KB, optimized)
make clean  → remove artifacts

No CMake. No configure. No autotools. One command.