
The engineering companion to the strategic piece on local inference, deliberately exhaustive. llama.cpp build flags for Blackwell, VRAM accounting to the MiB, context ceilings per quantization, prefill and decode throughput with and without MTP, a roofline analysis of why speculative decoding helps this MoE, a 200-call agentic tool-calling harness, and an autopsy of a KV-cache compression technique that crashed with its CUDA stack trace. Every figure measured on one fixed rig.
Continua a leggere