Better output from the same model. Fused computation, adaptive precision, surgical expert loading. 305 KB, 19 backends, zero dependencies. https://inference-x.com
138 lines
6.2 KiB
Markdown
138 lines
6.2 KiB
Markdown
# Vision
|
||
|
||
> *"What if the model already knew the answer — and the engine was just in the way?"*
|
||
|
||
---
|
||
|
||
## The hidden problem
|
||
|
||
AI models are trained for months on thousands of GPUs in full precision. The intelligence is in the weights. The training is done. The model knows what it knows.
|
||
|
||
Then we run inference.
|
||
|
||
And between the weights and your screen, we add: a framework (hundreds of megabytes), a runtime allocator, intermediate buffers, uniform quantization across all layers, inactive experts consuming memory, rounding errors accumulating at every conversion step.
|
||
|
||
By the time the model's signal reaches you, it's been filtered through layers of computational noise that the model never asked for.
|
||
|
||
Every inference engine does this. They add complexity to manage complexity. They add abstraction to manage hardware. They add overhead to manage scale.
|
||
|
||
**We asked a different question: what if we removed it all?**
|
||
|
||
---
|
||
|
||
## The idea
|
||
|
||
Inference-X is not a faster engine. It's a *cleaner* one.
|
||
|
||
The same model, through Inference-X, produces output that is closer to its theoretical full-precision maximum — because the computation path between the weights and your screen has fewer steps, fewer conversions, fewer points where information degrades.
|
||
|
||
This isn't a feature. It's the architecture.
|
||
|
||
```
|
||
Standard engine:
|
||
Weights → Framework → Dequant buffer → MatMul → Buffer → Output
|
||
5 steps. Rounding at each boundary. ~100 MB binary.
|
||
|
||
Inference-X:
|
||
Weights → Fused dequant+dot → Output
|
||
2 steps. Zero buffer. 305 KB binary.
|
||
```
|
||
|
||
The binary is so small it fits in your CPU's L2 cache. The engine is invisible. You hear the model, not the framework.
|
||
|
||
---
|
||
|
||
## Three innovations
|
||
|
||
### 1. Adaptive precision
|
||
|
||
Not every question is hard. Not every layer matters equally.
|
||
|
||
Inference-X analyzes each query before inference begins — using Shannon entropy and vocabulary diversity — and assigns precision per layer based on what the question actually needs.
|
||
|
||
Simple question? Early layers drop to Q2_K, saving 26% memory. Decision layers stay at full precision. Complex reasoning? Everything stays at maximum. The model *breathes* with the question.
|
||
|
||
No other engine does this. They apply uniform precision because it's simpler to implement. We apply information-theoretic precision because it's closer to how intelligence actually works: attention is selective.
|
||
|
||
### 2. Fused computation
|
||
|
||
Standard engines dequantize quantized weights to a temporary FP32 buffer, then perform the matrix multiply against that buffer. Two memory passes. One temporary allocation. Rounding errors at each conversion boundary.
|
||
|
||
Inference-X fuses both operations into a single instruction loop. The quantized value is decoded and multiplied in the same cycle, with the result accumulated directly into the output register. No buffer. No intermediate storage. Fewer floating-point operations means fewer rounding errors.
|
||
|
||
For 10 quantization formats, we have hand-tuned AVX2/AVX-512 SIMD kernels that perform this fusion. The result is output that is mathematically closer to the FP32 theoretical maximum.
|
||
|
||
### 3. Surgical expert management
|
||
|
||
Modern MoE models have 256–384 experts but activate only 8 per token. Standard engines load all experts and let the OS manage caching. This means 97% of the model's parameters are in memory, competing for CPU cache, adding noise to the memory bus — for nothing.
|
||
|
||
Inference-X tracks which experts are active and surgically evicts the rest at the OS level (`madvise`). The signal path contains only the parameters that contribute to the current token. Nothing else exists in memory.
|
||
|
||
This is how a 1-trillion-parameter model (Kimi K2.5, 226 GB) runs on a machine with 17 GB of RAM. Not by being clever about compression. By being precise about *what doesn't need to exist*.
|
||
|
||
---
|
||
|
||
## What this means
|
||
|
||
### For developers
|
||
|
||
The same model, better output, less hardware. A 7B through Inference-X may match a 13B through a standard engine — because the signal loss is lower. Your inference costs drop. Your hardware requirements shrink. Your users get better answers.
|
||
|
||
### For hardware manufacturers
|
||
|
||
One 305 KB binary supports 19 hardware backends. Integrate once, support every model. No framework lock-in. No vendor dependency. The protocol adapts to your silicon — you don't adapt to the protocol.
|
||
|
||
### For the world
|
||
|
||
The current architecture of AI concentrates intelligence: a few companies, a few countries, a few power grids decide who gets to think. Inference-X runs a trillion parameters on a single server. It runs 7B models on a Raspberry Pi. It compiles for microcontrollers.
|
||
|
||
Intelligence doesn't need to be expensive. It needs to be *clean*.
|
||
|
||
---
|
||
|
||
## Solar inference
|
||
|
||
Every hour, the Sun delivers more energy to Earth than humanity uses in a year. 173,000 terawatts, falling on deserts, rooftops, forgotten places.
|
||
|
||
If inference requires 5–15 kW per rack, you need solar farms and battery banks.
|
||
|
||
If inference requires 25 watts, you need a camping panel.
|
||
|
||
Adaptive precision was built for a different reason. But it turns out: an engine that can dynamically shift between Q2 and FP16 is exactly what solar inference needs. When the Sun is high, full precision. At twilight, compressed. At night, minimal.
|
||
|
||
The engine breathes with the Sun like it breathes with the question.
|
||
|
||
The first solar deployment target is 2026. Anti-Atlas, Morocco. 320 days of sun per year. The nearest datacenter is 1,000 kilometers away.
|
||
|
||
---
|
||
|
||
## The timeline
|
||
|
||
We don't announce timelines. We announce results.
|
||
|
||
- The engine is done. 305 KB. Running in production.
|
||
- The technology page explains how it works: [TECHNOLOGY.md](TECHNOLOGY.md)
|
||
- The benchmarks are real: [BENCHMARKS.md](BENCHMARKS.md)
|
||
- The web interface is live: [inference-x.com](https://inference-x.com)
|
||
- The solar adaptation is in development.
|
||
|
||
---
|
||
|
||
## A final thought
|
||
|
||
Every great infrastructure made something abundant that was once scarce. Aqueducts made water abundant. Roads made trade abundant. The internet made information abundant.
|
||
|
||
The next abundance is intelligence. Not artificial. Not corporate. Not as-a-service.
|
||
|
||
Just intelligence. Clean. Accessible. Powered by whatever energy is available — from a datacenter to a star.
|
||
|
||
The model already knows. The engine just needs to get out of the way.
|
||
|
||
---
|
||
|
||
*Salka Elmadani*
|
||
*February 2026*
|
||
*Built in Morocco for the world.*
|
||
|
||
◆
|