Salka Elmadani ec36668cf5 Inference-X v1.0 — Universal AI Inference Engine

Better output from the same model. Fused computation, adaptive precision,
surgical expert loading. 305 KB, 19 backends, zero dependencies.

https://inference-x.com

2026-02-23 07:10:47 +00:00

7.9 KiB

Raw Blame History

Architecture

Design Principle

The shortest path between model weights and silicon. The cleanest signal from weights to output.

Every design decision serves two goals: route intelligence to any hardware, and preserve the model's original signal through the computation path. No abstraction that doesn't earn its keep. No layer that doesn't serve the signal path. No buffer that introduces unnecessary rounding. Subtract rather than add.

System Overview

┌─────────────────────────────────────────────────────────────┐
│  infer.cpp — Application Layer                               │
│  CLI parsing, chat templates, mode dispatch                  │
│  Modes: interactive (-i), serve (--serve), batch (-p)        │
├──────────────┬──────────────┬────────────────────────────────┤
│  server.h    │  fractal.h   │  identity.h                    │
│  HTTP API    │  Dynamic     │  Authorship                    │
│  SSE stream  │  precision   │  verification                  │
│  OpenAI fmt  │  Q2→FP16     │  4-layer protect               │
├──────────────┴──────────────┴────────────────────────────────┤
│  transformer_v6.h — Compute Layer                            │
│  Forward pass orchestration, layer iteration, KV cache       │
├──────┬───────┬──────────┬───────────┬────────────────────────┤
│ attn │ moe   │ expert   │ gemm.h    │ tokenizer.h            │
│ .h   │ _mla  │ _mmap.h  │ 23 quant  │ BPE + 7 templates     │
│ GQA  │ .h    │ surgical │ fused dot │ Llama3/Gemma/Phi3/     │
│ MLA  │ route │ prefetch │ zero-copy │ Mistral/ChatML/Kimi    │
├──────┴───────┴──────────┴───────────┴────────────────────────┤
│  kernel_dispatch.h — Hardware Abstraction Layer               │
│  Auto-detects hardware at compile time, routes to backend     │
├──────┬────────┬──────────┬──────────┬────────┬───────────────┤
│ x86  │ ARM    │ CUDA     │ Hexagon  │ TPU    │ 8 more...    │
│ AVX2 │ NEON   │ Metal    │ Snapdrgn │ Groq   │              │
│ 512  │        │ Vulkan   │          │ FPGA   │              │
├──────┴────────┴──────────┴──────────┴────────┴───────────────┤
│  platform.h — OS Abstraction                                  │
│  Socket API (POSIX/Winsock), mmap, threading, RAM detection   │
├──────────────────────────────────────────────────────────────┤
│  z_core.h — Type System                                       │
│  23 quantization formats, block structures, dequant functions  │
│  gguf.h — Model Loader (multi-shard GGUF format)             │
└──────────────────────────────────────────────────────────────┘

Module Reference

Application Layer

Module	Lines	Purpose
`infer.cpp`	~570	Entry point, CLI, mode dispatch
`runtime/server.h`	~530	OpenAI-compatible HTTP API, SSE streaming
`runtime/fractal.h`	~320	Dynamic precision per layer (fractal inference)
`runtime/identity.h`	~160	Cryptographic authorship, 4-layer protection

Compute Layer

Module	Lines	Purpose
`runtime/transformer_v6.h`	~1200	Forward pass, layer iteration, KV cache
`runtime/attention.h`	~800	Multi-head attention, GQA, MLA
`runtime/moe_mla.h`	~700	Mixture-of-Experts routing, Multi-head Latent Attention
`runtime/expert_mmap.h`	~400	Surgical expert loading, predictive prefetch, eviction
`runtime/gemm.h`	~1500	Fused dequant+matmul for 23 quantization formats
`runtime/tokenizer.h`	~600	BPE tokenizer, 7 chat templates, special tokens
`runtime/kernels.h`	~400	SIMD compute kernels (softmax, RMSNorm, RoPE)

Hardware Layer

Module	Lines	Purpose
`runtime/kernel_dispatch.h`	~400	Hardware detection, backend routing
`runtime/backends.h`	~200	Backend interface, hardware profiling
`runtime/platform.h`	~170	Cross-platform (Linux/macOS/Windows)
`backends/q4_kernels/`	~1500	19 platform-specific GEMM implementations

Foundation

Module	Lines	Purpose
`core/z_core.h`	~800	Type definitions, 23 quant block structures
`core/iq_tables.h`	~200	Importance quantization lookup tables
`runtime/gguf.h`	~1200	GGUF model loader, multi-shard support

Key Design Decisions

Fused Dequant+Dot

Standard approach: dequantize to FP32 buffer → matrix multiply against buffer. Two passes. One temporary allocation. Rounding errors at each boundary.

Our approach: dequantize and accumulate in a single loop iteration. One pass. Zero buffer. Fewer floating-point operations means output closer to the model's FP32 theoretical maximum.

Eliminates intermediate buffer allocation. Supports all 23 formats with hand-tuned AVX2/AVX-512 SIMD kernels. The computation path is mathematically cleaner — not just faster.

Expert mmap (Surgical Loading)

For MoE models (DeepSeek, Kimi), only 8 of 256+ experts are active per token. Loading all experts wastes 97% of I/O bandwidth — and fills the CPU cache with parameters that contribute nothing to the current token.

Expert mmap loads only active experts via memory-mapped files with predictive prefetch. Layer N's routing decision triggers prefetch for layer N+1. Inactive experts are surgically evicted via madvise(DONTNEED).

Result: 48× I/O reduction for trillion-parameter models. The signal path contains only parameters that contribute to the current answer. Nothing else exists in memory.

Fractal Inference (Adaptive Precision)

Query complexity determines layer precision. Shannon entropy of input tokens + vocabulary diversity → composite complexity score → per-layer quantization map.

Simple query ("2+2=") → early layers at Q2_K, late layers at Q4_K → 26% RAM savings, zero quality loss. Complex reasoning → all layers at base precision → maximum signal fidelity.

The model breathes. Same file. Same binary. Different depth. Precision is allocated where it contributes to signal, and removed where it only adds noise. This is information-theoretic optimization applied to inference.

Identity Integration

Author identity constants participate in the kernel dispatch hash seed. The dispatch table initialization uses these constants mathematically. Removing them changes kernel selection, which changes numerical results.

This is not DRM — it's structural attribution. The author's identity is fused with the inference path at the mathematical level.

Build System

Single-file compilation. The Makefile auto-detects:

x86: AVX2/AVX-512 via compiler intrinsic tests
ARM: NEON detection
OpenMP: automatic parallelization

make        → inference-x (305 KB, optimized)
make clean  → remove artifacts

No CMake. No configure. No autotools. One command.

7.9 KiB Raw Blame History Unescape Escape