Running an LLM Locally on a 16GB Consumer GPU: Why It Suddenly Matters in 2026

On 12 June 2026, Anthropic had to suspend access to its Fable 5 and Mythos 5 models for every customer worldwide, including those using them through the API and third-party clouds, following a US government export-control directive (Anthropic announcement, covered by CNBC and Bloomberg). Not a downtime, not a commercial decision by the vendor: an asset that companies had built production workflows on became unreachable by order of an external authority, against the vendor's own stated position. As of writing the situation is evolving and Anthropic is working on restoration, but the precedent is set: the availability of a frontier model in the cloud is not a property of the contract you signed. Full disclosure, since I will be discussing this: I use Claude and Fable in production every day, and what follows is not an anti-vendor argument, it is an architecture lesson that applies to every provider.

That episode pushed me to measure, precisely, a question my clients had been asking for months: how seriously can you run LLM inference at home, on hardware that costs about as much as a decent laptop, without depending on any API? I took an NVIDIA RTX 5060 Ti with 16GB (Blackwell architecture) and ran a 35-billion-parameter Mixture-of-Experts model on it for days, measuring throughput, VRAM footprint, agentic tool-calling reliability and the exact failure points. This article is the why; the full engineering write-up, with the bench setup, the benchmark methodology and every raw number, is the companion piece linked at the end.

Why run an LLM locally instead of calling a cloud API?

Because in the cloud you do not control three things that became concrete risks in 2026: the model's availability, your data residency and the portability of your work. The Fable case is the textbook example of the first risk. The second is just as material: as of 9 June 2026 Anthropic introduced a forced 30-day retention on Mythos-class models that overrides Zero Data Retention agreements even when the model runs on AWS Bedrock, Google Cloud or Microsoft Foundry (Anthropic support documentation). For a company that negotiated ZDR precisely to stay compliant with GDPR or NIS2, this means that using that model requires disabling the very guarantee the compliance rested on.

An open-weight model running on your own hardware zeroes out all three problems by construction: the weights are on your disk and nobody can revoke them, the data never leaves your network, and the model you use today you will use identically two years from now. I am not saying abandon frontier models, which remain superior in raw capability: I am saying a serious architecture designs for their replaceability, and the sovereign fallback is a local model that runs your work when the API does not answer. The fact that the "pact" between publishers and AI crawlers is now broken reinforces the point: according to Cloudflare data, in 2025 the ratio of pages crawled to traffic referred back was 14 to 1 for Google Search, but roughly 1,700 to 1 for OpenAI's crawler and up to 73,000 to 1 for Anthropic's. Who provides value and who extracts it has shifted, and keeping the infrastructure in-house is a form of independence.

The availability of a frontier model in the cloud is not a property of the contract you signed. It is a concession that a third-party jurisdiction can revoke. Designing for that model's replaceability is not excessive caution, it is the lesson of 2026.

What does a 16GB consumer GPU actually handle?

It handles a 35-billion-parameter Mixture-of-Experts model with up to 262,000 tokens of context, entirely in VRAM, with nothing spilling to the CPU. The model I used is Qwen3.6-35B-A3B, released under the Apache 2.0 license and distributed as quantized GGUF by Unsloth. The "A3B" tag is the key to everything: out of those 35 billion parameters, only about 3 billion are active per generated token. It is the Mixture-of-Experts architecture that makes inference on limited VRAM practical: all the weights must be loaded into memory, but the compute per token touches only a fraction, so the speed is that of a much smaller model while the capacity is that of a large one.

VRAM is the constraint that decides everything, and it has to be measured honestly. On my card, out of the nominal 16,311 MiB about 15,751 are actually available to the inference engine, because a few hundred MiB are always occupied. Within that budget three things must coexist: the model weights, the KV cache (which grows with context length) and the compute buffers. I tested two quantization levels, with clear-cut results.

Quantization	On-disk size	Max context fully in VRAM	VRAM used
IQ3_S	14.3 GB	160,000 tokens	15,776 MiB
IQ3_XXS	13.1 GB	262,000 tokens (native max)	15,338 MiB

The reading is precise: the lighter quantization (IQ3_XXS), being 1.2 GB smaller, leaves enough room for the KV cache to reach the full native context of 262,000 tokens with almost 1 GB of VRAM still free. The heavier one (IQ3_S) stops at 160,000 tokens, because beyond that threshold the compute buffer pushes the card into out-of-memory. The full breakdown, including why exactly the allocation fails and how batch size changes the ceiling, is in the technical deep-dive.

How fast is it in practice?

It generates between 55 and 90 tokens per second at long context, and here you have to separate two phases that most comparisons conflate. The first is prefill, the reading of the initial prompt: on my card it runs at about 1,250 tokens per second, so an 80,000-token prompt is processed in a little over a minute, a cost you pay once because subsequent interactions in the same session use the cache and answer in about a second. The second phase is decode, the actual token-by-token generation, and that is what you perceive as response speed.

With a real 80,000-token prompt, decode settles around 58 tokens per second in the base configuration. Turning on Multi-Token Prediction (llama.cpp's --spec-type draft-mtp, a form of speculative decoding native to the model, merged into the project's main branch in May 2026) the speed rises to 88 tokens per second, a gain of over 50% with no loss of accuracy. To give scale: these are two to three times the numbers the same card achieved on dense 12-to-14-billion-parameter models of the previous generation. There is a trade-off to know: Multi-Token Prediction costs about 740 MiB of extra VRAM, so on IQ3_S its practical use tops out around 100,000 tokens of context. The choice becomes clean: maximum speed with MTP up to roughly 96-100K, or maximum context (160K on IQ3_S, 262K on IQ3_XXS) without the boost.

Does it actually work for agentic coding, like a Continue plugin in VS Code?

Yes, and this is the result that surprised me most: tool-calling came out 100% reliable at the protocol level, with zero format errors across 200 measured calls. An agentic coding assistant like Continue does not just generate text: it calls tools (read a file, run a command, search the code) through an OpenAI-compatible API, and expects responses in a strict function-calling format. If the model gets that format wrong even occasionally, the agent loops and becomes unusable. That was the historical problem of local agentic inference.

I built a test harness that simulates exactly that behavior: a Continue-style agentic system prompt, five defined tools (read file, write file, list directory, search code, run shell command), ten scenarios repeated five times each with a varying seed, plus three multi-turn scenarios where after the tool call I fed back the result and verified the next action. The results were essentially identical on both quantizations: 100% protocol reliability (no malformed JSON, no nonexistent tool, no missing required argument across 200 calls), 0% real errors, and 100% multi-turn task completion. The few cases where the model did not pick the "expected" tool were in fact legitimate exploration (it listed the directory before running a command), correct agentic behavior that resolves on the next turn in a real loop.

One decisive technical detail for anyone integrating these models: Qwen3.6 is a thinking model, it generates an internal reasoning block before answering. llama.cpp, launched with the --jinja flag, routes that reasoning into a separate reasoning_content field, leaving the tool-call field clean. That is exactly what you need so the model's thinking does not pollute the function call, and it is why integration with an agentic plugin works without adapters. The full harness, the per-scenario breakdown and the speed numbers under MTP are in the companion article.

When is local inference NOT the right answer?

When you need production throughput for many concurrent users, or when the absolute quality of the generated code has to be at the very top. It is essential to separate two classes of stack that are often confused. Ollama and llama.cpp, which I used here, are workstation, edge and prototyping tools: excellent for a single developer or a small team, for data that must not leave, for a proof-of-concept. To serve a high-throughput production application there are different engines, vLLM and SGLang, designed for concurrency and datacenter GPUs. A proof-of-concept on Ollama is not a production sizing, and it is dishonest to sell it as one.

There is also the quality question. The 3-bit quantization I used keeps about 95% of the full-precision model's capability, against 99% for a 4-bit quantization. I measured tool-calling protocol reliability and long-context information retrieval, which are perfect, but not the absolute quality of the code produced on a benchmark like SWE-bench, which would require a separate analysis. It is also worth noting that declared context is not usable context: quality degrades as the context fills, a phenomenon known as context rot, so 262,000 tokens should be read as a maximum capacity, not a default operating window. Most real enterprise workloads (extraction, classification, well-built RAG) sit comfortably under 200,000 tokens.

Running a serious LLM locally in 2026 is no longer a compromise you accept reluctantly, it is an architectural choice with solid numbers behind it: a 35-billion-parameter Mixture-of-Experts, 262,000 tokens of context and reliable agentic tool-calling, on a graphics card costing a few hundred euros. It does not replace frontier models where you need maximum capability, but it is the sovereign fallback that makes you independent when an API is suspended by government decision, or when compliance forbids sending the data out.

If you want the full engineering story, the bench, the benchmark methodology, every raw measurement, the out-of-memory analysis and the things that failed (including a KV-cache compression technique that did not survive contact with this hardware), read the companion deep-dive: Benchmarking Qwen3.6-35B-A3B on a 16GB RTX 5060 Ti. And if what you want is narrower, small 9-billion-parameter coding agents you can run as a daily assistant and whether one can stand in for Copilot, the third part of the series benchmarks two of them head to head: Ornith vs Qwable, two 9B coding agents on a 16GB GPU, and a fourth part works out what a self-hosted coding LLM actually costs against the APIs and Copilot.