Benchmarking Qwen3.6-35B-A3B on a 16GB RTX 5060 Ti: A Full Engineering Teardown

Benchmarking Qwen3.6-35B-A3B on a 16GB RTX 5060 Ti: A Full Engineering Teardown

This is the engineering half of a two-part write-up, and it is intentionally dense. The strategic argument, why local inference became architectural insurance in 2026 after a frontier model was suspended worldwide by government order, lives in the companion article: Running an LLM Locally on a 16GB Consumer GPU. Here I document the bench, the method, every raw number and the relationships between them, so the claims there rest on something you can reproduce and audit. The work was inspired by a community write-up that ran the previous-generation Qwen3.5-35B-A3B at 160K context on the same card (njannasch.dev); this is a more controlled pass on the newer Qwen3.6, with VRAM accounting, a bandwidth roofline and a full agentic evaluation the original did not cover.

Disclosure: I use Claude and Fable in production daily. This benchmark is about open-weight local inference and is vendor-neutral.

The test bench

Every number below comes from one fixed rig and one fixed base configuration, so the comparisons are clean.

ComponentDetail
GPUNVIDIA RTX 5060 Ti, 16,311 MiB total. The CUDA runtime reports the device as 15,888 MiB with 15,751 MiB free, so ~423 MiB is reserved by the CUDA context and ~560 MiB is unavailable to the engine before a single weight loads
Memory bandwidth448 GB/s (the number the roofline section leans on)
CPU / RAMAMD Ryzen 5 4500, 6 cores / 12 threads (llama.cpp uses 6 compute threads), 31,889 MiB RAM, 24 GiB swap
OS / driverDebian 13, kernel 7.0.10, NVIDIA driver 610.43.02, CUDA toolkit 13.3, gcc 14.2
Enginellama.cpp mainline, commit 558e221, ggml 0.15.1
ModelQwen3.6-35B-A3B (MoE, 35B total / ~3B active per token, 262,144 native context, thinking model), Unsloth UD GGUF

That ~560 MiB idle gap is not pedantry: every ceiling in this article is measured against ~15,751 MiB of usable VRAM, not the 16,311 on the box. It is the difference between a config that loads and one that aborts.

Building llama.cpp for Blackwell

The RTX 5060 Ti is Blackwell, compute capability 12.0, so the build must target sm_120. The toolkit was already present; configure and build:

export PATH=/usr/local/cuda/bin:$PATH
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 \
  -DLLAMA_CURL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j 8

CMake rewrites 120 to 120a (the architecture-specific variant) automatically. Configure took 4.1 s; the full build took about ten minutes on the 6-core CPU at -j 8. Two warnings are worth noting and are harmless here: OpenSSL was not found (HTTPS disabled in the bundled httplib, irrelevant for a localhost server) and NCCL was not found (multi-GPU only). The server's own system_info confirms the build is correct for this silicon:

CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1
CPU  : AVX2 = 1 | FMA = 1 | F16C = 1 | LLAMAFILE = 1 | OPENMP = 1

BLACKWELL_NATIVE_FP4 = 1 is the line that matters: the build uses the card's native FP4 path. A generic CUDA binary that does not target sm_120a leaves measurable throughput on the table.

The model and the two quantizations

Qwen3.6-35B-A3B is a Mixture-of-Experts model: 35 billion total parameters, of which only about 3 billion are active per token. All weights must be resident in VRAM, but per-token compute touches a small slice, which is what makes a 35B-class model viable on 16 GB. The weights come from the unsloth/Qwen3.6-35B-A3B-MTP-GGUF repository, which bundles the Multi-Token Prediction head:

QuantizationExact size (bytes)Size (MiB)Size (GiB)
IQ3_S15,346,432,28814,63514.3
IQ3_XXS14,069,266,72013,41713.1

The 1,218 MiB difference between the two is the entire story of the context ceiling, and the VRAM-accounting section below shows exactly how that 1.2 GiB converts into KV-cache headroom.

Methodology

All measurements use llama-server with per-request options, no warmup, one sequence slot. The fixed base configuration:

-ngl 99 -fa on -np 1 --cache-type-k q4_0 --cache-type-v q4_0 -b 512 -ub 256 --no-warmup

A methodological note on -np 1: the server default spins up four sequence slots that share -c, which fragments KV accounting. I forced a single slot, both because the target is a single-user coding assistant and because Multi-Token Prediction is incompatible with -np > 1.

  • Throughput comes from llama.cpp's native timings fields on /completion: prompt_per_second is prefill, predicted_per_second is decode. These are the engine's own counters, not wall-clock approximations.
  • VRAM is read from nvidia-smi --query-gpu=memory.used while the server is resident; idle draw is 2 MiB, so the reading is the engine's footprint plus the CUDA context.
  • Spill detection greps the server logs for any CPU buffer allocation. Full-GPU placement means -ngl 99 offloaded everything and no CPU buffer appears.
  • Long-context retrieval uses a needle-in-a-haystack prompt: BANANA-42-ZULU at the very top of a document, the question at the very bottom. The first attempt used 9,000 filler lines (~231K tokens) and was rejected for exceeding context; the working prompt uses 3,000 lines, which tokenized to 79,605 prefill tokens measured. The model passes only if it reproduces the needle.
  • Agentic tool-calling uses a Python harness against /v1/chat/completions with --jinja, detailed in its own section.

Every test is a clean start, measure, kill cycle, so configurations never contaminate each other.

Result 1: the context ceiling is set by the compute buffer, not the weights

IQ3_S tops out at 160K tokens fully in VRAM, and the wall is the compute buffer, not the weights. Context probe on IQ3_S, -np 1:

ContextBatchOutcomeFailing allocationVRAM at load
163,840 (160K)512/256OK, no spilln/a15,776 MiB
196,608 (192K)512/256OOM506.27 MiB compute buffern/a
262,144 (262K)512/256OOM666.27 MiB compute buffern/a
262,144 (262K)2048/512OOM820.28 MiB compute buffern/a

Two things are pinned down here. First, the out-of-memory is never on the weights, which load fine, it is on the compute buffer that flash-attention needs for the forward pass. Second, batch size directly sets that buffer: at 262K it is 820 MiB with the default -b 2048, and 666 MiB with -b 512 -ub 256, a 154 MiB swing that is the difference between two configs that both still fail at 262K on this quant but tells you exactly which lever moves the ceiling. With roughly 500 MiB of headroom at 160K, an oversized batch spikes the compute buffer and the process aborts with no graceful degradation. On this card the batch size is not a tuning nicety, it is load-or-crash.

Result 2: IQ3_XXS reaches the full 262K native context

The lighter quantization, 1.2 GiB smaller, changes the ceiling completely:

QuantizationMax context fully in VRAMVRAM at that contextFree headroom
IQ3_S163,84015,776 MiB~535 MiB
IQ3_XXS262,14415,338 MiB~973 MiB

IQ3_XXS hits the model's full native 262,144-token context with almost 1 GiB still free, no spill, single sequence, in 18 seconds of load time. The KV cache is pre-allocated at load against -c, so this footprint is stable: it does not grow as the context fills during a session.

VRAM accounting, derived to the MiB

This is the relationship the headline tables hide. Subtract the known weight size from measured nvidia-smi usage and you isolate everything else (KV cache, compute buffer, CUDA context):

QuantContextModeWeights (MiB)Total used (MiB)Derived non-weight (MiB)
IQ3_S98,304base14,63515,256621
IQ3_XXS98,304base13,41714,038621
IQ3_S163,840base14,63515,7761,141
IQ3_XXS262,144base13,41715,3381,921
IQ3_S98,304+MTP14,63515,7121,077
IQ3_XXS98,304+MTP13,41714,4941,077

Three relationships fall out of this table, and they are the useful part:

  • The non-weight footprint is identical across quantizations at equal context (621 MiB at 98K for both IQ3_S and IQ3_XXS). The KV cache with q4_0 depends on the context length, not on the weight precision. This is why IQ3_XXS buys context, not speed: its only advantage is that smaller weights leave more of the fixed 15,751 MiB budget for that context-dependent KV.
  • KV scales with context, slightly super-linearly: 621 MiB at 98K, 1,141 at 164K, 1,921 at 262K. The growth above linear is the compute buffer, which also grows with sequence length on top of the per-token KV.
  • Multi-Token Prediction costs a measured +456 MiB (1,077 vs 621 at 98K), constant across both quantizations. The server's load-time log estimated this at 740 MiB (420 MiB MTP context plus a 320 MiB draft KV); the steady-state measurement is lower, but the practical consequence holds: on IQ3_S, where headroom at high context is thin, MTP caps usable context at roughly 96-100K.

Result 3: prefill and decode throughput

Measured against the real 79,605-token prompt, needle always recovered:

QuantContextModePrefill (t/s)Decode (t/s)VRAM (MiB)
IQ3_S96Kbase1,25958.515,256
IQ3_S96K+MTP1,18088.115,712
IQ3_S160Kbasen/a58.515,776
IQ3_XXS96Kbase1,24955.314,038
IQ3_XXS96K+MTP1,16989.714,494

Multi-Token Prediction is --spec-type draft-mtp --spec-draft-n-max 2, merged to mainline in May 2026. Note that MTP slightly lowers prefill (1,259 to 1,180 t/s) because the draft adds overhead to prompt processing, while it sharply raises decode. The two quantizations are within noise of each other on speed: IQ3_XXS is not slower, its win is context.

Decode falls as context fills, and MTP holds its gain

Decode throughput is not a single number, it depends on how full the KV cache is, because every generated token attends over the whole context. Putting the tool-calling runs (short ~1K prompts) next to the needle runs (80K prompt) isolates this:

QuantModeDecode at ~1K contextDecode at 80K contextDrop
IQ3_Sbase103.958.5-44%
IQ3_S+MTP154.188.1-43%

Filling 80K of KV costs about 44% of decode throughput, and MTP pays the same percentage tax while keeping its absolute lead. The speculative gain itself, measured across all four long-context and tool-calling runs, is remarkably consistent:

RunBase (t/s)MTP (t/s)Gain
IQ3_S, 80K context58.588.1+50.6%
IQ3_XXS, 80K context55.389.7+62.2%
IQ3_S, tool-calling103.9154.1+48.3%
IQ3_XXS, tool-calling94.3150.2+59.3%

Why speculative decoding works on this MoE: a roofline read

Decode on a quantized LLM is memory-bandwidth bound, so the question of whether speculation helps is a question of how much of the 448 GB/s the model already eats. With ~3B active parameters at ~3.5 bits per weight, each token reads roughly 1.3 GB of weights. At the measured 88 t/s that is about 115 GB/s, or 26% of the card's bandwidth. That large headroom is exactly what lets the MTP draft pass run essentially for free: there is spare bandwidth to verify speculated tokens.

Contrast a dense 14B model at Q4: ~7 GB of weights read per token, and at the ~30 t/s such a model managed on this card in earlier tests that is ~210 GB/s, near 47% of bandwidth, with far less slack. The same speculation that pays off on the sparse MoE tips a dense model toward bandwidth contention. The MoE is not just smaller per token, it is structurally a better fit for speculative decoding on a bandwidth-limited consumer card. This also explains why prefill (compute-bound, ~1,250 t/s) is never the bottleneck here: an 80K prompt prefills in ~63 seconds once, then the cache makes incremental turns near-instant.

Result 4: needle-in-a-haystack

Long-context retrieval passed in every configuration, at 80K of real context, both quantizations, base and MTP. As a thinking model the raw output sometimes carried a populated <think> block and sometimes an empty one (<think>\n\n</think>\n\nBANANA-42-ZULU), but the needle was always reproduced verbatim. Aggressive 3-bit quantization does not break retrieval here.

Result 5: the agentic tool-calling harness

This is the test a Continue-style plugin actually depends on, because a single malformed function call breaks an agentic loop. I built a Python harness against the OpenAI-compatible endpoint with --jinja, a Continue-style agentic system prompt and five tools (read_file, write_file, list_directory, search_codebase, run_terminal_command). It runs 10 scenarios x 5 repetitions with a varying seed (50 single-turn calls), plus 3 multi-turn scenarios (read a file, receive content, write the edit back), at temperature 0.3. Each response is scored into: valid exact-match, legitimate exploration-first, real error (wrong tool / malformed JSON / missing required argument), or correct abstention.

A standalone smoke test first confirmed the mechanism: a get_weather call returned finish_reason: tool_calls with an OpenAI-shaped function.name and arguments: {"city":"Turin"}, a generated call id, and crucially the thinking-model reasoning routed into a separate reasoning_content field, leaving content empty and the tool_calls array clean. The <think> never touches the function call, which is why no adapter is needed.

Aggregate results, 200 single-turn calls plus 12 multi-turn runs total:

QuantModeProtocol reliabilityReal errorsFormat errorsExact-matchExploration-firstMulti-turnTool decode (t/s)VRAM (MiB)
IQ3_Sbase100%0%082%18%100%103.914,736
IQ3_S+MTP100%0%080%20%100%154.115,034
IQ3_XXSbase100%0%090%10%100%94.313,518
IQ3_XXS+MTP100%0%090%10%100%150.213,816

The per-scenario breakdown (base configuration) shows the exact-match miss is not spread randomly, it concentrates on two scenarios where the model explores before acting:

ScenarioExpected toolIQ3_SIQ3_XXS
read_explicitread_file5/55/5
list_dirlist_directory5/55/5
run_testsrun_terminal_command0/50/5
search_symbolsearch_codebase / run5/55/5
create_filewrite_file1/55/5
read_before_fixread_file5/55/5
git_statusrun_terminal_command5/55/5
search_usagesearch_codebase / run5/55/5
list_then_readlist_directory / read5/55/5
no_tool (abstention)none5/55/5

How to read this:

  • Zero protocol errors and zero real errors across 200 single-turn calls. No malformed JSON, no nonexistent tool, no missing required argument, on either quantization, with or without MTP.
  • Every exact-match miss is run_tests or create_file, and the model called list_directory first. That is correct agentic behavior, the model orients before it acts, and it resolves on the next turn, which is exactly what the 100% multi-turn completion confirms.
  • The 82% vs 90% gap is sampling noise, not a quality ranking: IQ3_XXS happened to skip the orientation step on create_file (5/5) in this run, IQ3_S did not (1/5). Both are perfect on protocol and on the no-tool abstention scenario, which is the harder discipline (not calling a tool when none is needed).
  • MTP accelerates tool-calling by 48-59% at identical reliability, and adds only ~300 MiB at this short context.

Result 6: the autopsy of TurboQuant

The most useful part of any benchmark is the failure. TurboQuant is a KV-cache compression technique (Walsh-Hadamard rotation plus polar quantization, ICLR 2026) promising 3-6x KV compression and therefore much longer context on the same VRAM. It ships in a llama.cpp fork (commit b0e900a, ggml 0.9.11) that adds turbo2/turbo3/turbo4 cache types. I built it for sm_120a, another ten-minute build, and ran it on IQ3_S at 98K.

It works, and the KV cache really is tiny. The log:

llama_kv_cache: TurboQuant rotation matrices initialized (128x128)
llama_kv_cache: size = 375.00 MiB (98304 cells, 10 layers, 1/1 seqs),
                K (turbo3): 187.50 MiB, V (turbo3): 187.50 MiB
llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)

375 MiB of KV at 98K, against the 621 MiB of non-weight footprint that q4_0 carries at the same context. The needle passed. But the speed collapsed:

ConfigurationPrefill (t/s)Decode (t/s)NeedleOutcome
turbo3/turbo3, no speculation1,17325.7OKworks, ~44% of q4_0 decode
turbo3/turbo3 + NextN, batch 256/12868435.9OKworks only with reduced batch
turbo3/turbo3 + NextN, batch 512/256crashcrashn/aCUDA OOM during prefill

With normal batch the NextN speculative path plus turbo3 decompression crashed during prefill, at token 40,960 of 98,304 (progress 0.514), with an unambiguous signature:

CUDA error: out of memory
  in function alloc at ggml-cuda.cu:505  cuMemCreate(...)
launch_fattn<256, 8, 8>(...)
ggml_cuda_flash_attn_ext_mma_f16_case<256, 256, 8, 8>(...)

The head dimension is the culprit. Qwen3.6 has an attention head size of 256, double what TurboQuant's 128x128 rotation matrices were tuned for, and the flash-attention MMA kernel at head-256 plus the draft model's buffers blows past 16 GB on the prefill spike. Forcing -b 256 -ub 128 avoids the crash but drops prefill to 684 t/s and decode to 35.9 t/s.

Put the ratios next to each other and the verdict is arithmetic:

  • turbo3 alone, 25.7 t/s, is 0.44x the q4_0 base decode (58.5 t/s).
  • turbo3 + NextN, 35.9 t/s, is 0.41x the q4_0 + MTP decode (88.1 t/s), even though NextN recovers +40% over plain turbo3.
  • The KV saving (375 vs 621 MiB, ~40% less) buys context you cannot use at acceptable speed.

The Walsh-Hadamard decompression cost at every attention step, on a card already bandwidth-limited, is not worth it. TurboQuant earns its keep only above the 262K native context, which is not an operating regime today. I removed the fork.

Result 7: abliterated models, coding quality and refusal

The same paradigm extends cleanly to abliterated (uncensored) models, with two questions added: does removing the safeguards degrade capability, and are these models usable for daily coding? I added two benches. Coding pass@1 with executable verification: 10 algorithmic tasks, the generated code block is extracted and run against assertions in an isolated subprocess with an 8s timeout and a guard that refuses to execute code matching dangerous patterns (os/subprocess/socket/eval/open). Refusal rate: 16 boundary prompts in mild/dual-use categories (controversial opinions, mild profanity, security education, info aligned models deflect), output truncated to ~80 tokens with thinking disabled, classified refuse-vs-comply. A benign control set validates the classifier. This is defensive model evaluation, on-prem and logged, not generation of operational content.

Two models, against the aligned base as reference, all at ctx 32K:

ModelCoding pass@1Refusal rateTool exact / protocolMulti-turnTool decodeVRAM
Qwen3.6-35B-A3B base (IQ3_S)90%15.4%82% / 100%100%104 t/s14.7 GB
Huihui-Qwen3.6-35B-A3B-abliterated (IQ3_S)90%0%80% / 100%100%101 t/s14.8 GB
Huihui-gpt-oss-20b-mxfp4-abliterated70%0%98% / 100%100%96 t/s12.5 GB

The findings:

  • Abliteration does not degrade Qwen3.6. At equal quantization, the abliterated model has identical coding (90% pass@1) and identical tool-calling (80% vs 82%, within sampling noise), with 100% protocol reliability and 100% multi-turn intact. The common worry that abliteration wrecks reasoning or structured output does not show up on these benches.
  • The aligned base is already lightly censored (15.4% refusal): it produces profanity, controversial arguments, and dual-use info (chemical hazards, weak passwords, buffer overflow, phishing), refusing only step-by-step SQL injection and an explicit jailbreak. Abliteration takes refusal to 0%. The lesson: abliteration's value is largest on natively well-aligned models, less on an already-open one like Qwen.
  • gpt-oss-20b abliterated is the weakest coder (70%, a 20B struggles on non-trivial algorithms) but the strongest agent (98% exact-match, it goes straight to the tool instead of exploring first), at the lowest VRAM (12.5 GB). Refusal 0%.
  • For a capable uncensored local coding assistant, the abliterated Qwen3.6 is the pick: it pays nothing in quality versus the base. gpt-oss-abliterated is the light, agentic fallback when VRAM headroom matters more than code quality.

The constraint hierarchy, all the data in one frame

Putting every measurement together, the system has a clear order of binding constraints on this card:

  1. Weights must be fully resident. A 35B MoE only fits because ~3B are active; even so, weights eat 13.4-14.6 GiB of the 15.75 GiB usable.
  2. KV cache, not weights, sets the context ceiling. It is precision-independent (621 MiB at 98K regardless of quant) and grows super-linearly, so the lighter IQ3_XXS reaches 262K and the heavier IQ3_S stops at 160K.
  3. The compute buffer is the actual OOM trigger, scaling with batch and sequence length (506-820 MiB at the ceiling), which is why -b 512 -ub 256 is mandatory.
  4. Decode is bandwidth-bound and falls ~44% as context fills; MTP buys back 48-62% but costs a fixed 456 MiB, so it trades context for speed.
  5. Protocol reliability is not a constraint at all here: 100% across 200 agentic calls, on both quantizations.

That hierarchy is the whole engineering story: you are juggling a fixed 15.75 GiB between weights, a context-driven KV cache, and a batch-driven compute buffer, while decode speed is a bandwidth budget you spend on either context length or speculative speed.

Reproducible commands

Download the weights (public repo, no auth):

curl -L -C - --retry 5 \
  -o /opt/models/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
  https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-IQ3_S.gguf

Recommended agentic server (IQ3_S + MTP, 96K context):

/opt/llama.cpp/build/bin/llama-server \
  -m /opt/models/Qwen3.6-35B-A3B-UD-IQ3_S.gguf \
  -ngl 99 -fa on -c 98304 -np 1 \
  --cache-type-k q4_0 --cache-type-v q4_0 -b 512 -ub 256 \
  --jinja --spec-type draft-mtp --spec-draft-n-max 2 \
  --host 127.0.0.1 --port 11433

Maximum context (IQ3_XXS, 262K, base):

/opt/llama.cpp/build/bin/llama-server \
  -m /opt/models/Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
  -ngl 99 -fa on -c 262144 -np 1 \
  --cache-type-k q4_0 --cache-type-v q4_0 -b 512 -ub 256 \
  --jinja --host 127.0.0.1 --port 11433

Takeaways

GoalConfigurationWhat you get
Agentic default (Continue), max speedIQ3_S + MTP, ~96K context~88 t/s long-context decode, ~154 t/s tool-calling, 100% protocol reliability, 15,712 MiB
Maximum contextIQ3_XXS, 262K basefull native context, ~55 t/s, ~973 MiB free
Medium context, no MTPIQ3_S, 160K base~58 t/s, simplest setup, 15,776 MiB
KV compression (TurboQuant)not recommended0.41x the speed of q4_0 + MTP, crashes at normal batch, off-mainline

Limits of this test

I measured tool-calling protocol reliability and long-context retrieval, not the absolute quality of generated code (no local SWE-bench run). Decode was measured at ~1K and ~80K of filled context, not at a full 262K (which needs a ~1 MB prompt). Tool scenarios used 5 repetitions per scenario: the throughput numbers are solid, the exact-match figure carries sampling noise, so the protocol-reliability and real-error columns are the ones to trust. Concurrency was not tested (-np 1), and MTP is incompatible with -np > 1. The derived non-weight VRAM figures fold the ~423 MiB CUDA context into the KV-plus-compute total; only TurboQuant exposed a pure KV number (375 MiB) directly. And 3-bit quantization keeps roughly 95% of full-precision capability against ~99% for 4-bit, which does not fit at long context on 16 GB.

The strategic case for why any of this is worth doing, sovereignty, the Fable suspension, forced retention overriding Zero Data Retention, is in the companion article: Running an LLM Locally on a 16GB Consumer GPU.

Ultima modifica: