On Arm CPUs, 8-bit can beat 4-bit

AikaLabs · · 5 min read

The reflexive fix for a language model that runs too slowly on a CPU is to quantize it harder. Take the weights from 16 bits down to 8, and if it is still slow, down to 4. Smaller weights, less memory, faster math. That is the intuition, and on an Arm server it is often wrong. Eight-bit weights can run the compute-heavy part of inference faster than four-bit ones, and the single biggest speedup on the table is usually a build flag you forgot to set rather than the quantization at all.

Where the time actually goes

Prefill: reading the promptcompute-boundbig matrix multipliesDecode: writing the replymemory-boundwhole model, streamed from memoryone token at a time
Prefill is matrix-multiply heavy and loves int8 acceleration. Decode writes one token at a time and is limited by memory bandwidth, where a smaller model can win. The fastest quantization depends on which phase dominates your workload.

Running a model has two phases with opposite bottlenecks, and treating them as one is where most of the confusion starts. Prefill is the model reading your prompt. It is wall-to-wall matrix multiplication, and it is limited by how fast the chip can do that math. Decode is the model writing its answer one token at a time, and it is limited by memory bandwidth, because every single token has to stream the whole model out of memory. Arm's own profiling puts the share of time lost to memory stalls at roughly 10 percent during prefill and around 50 percent during decode (Arm).

Quantization plays differently in each. In decode, fewer bits means fewer bytes to move per token, so a 4-bit model is genuinely quicker. In prefill, the size of the weights matters less than how fast the hardware can multiply them. That is where Arm changes the answer.

The int8 matmul path

Arm cores from Armv8.6 onward carry an instruction, smmla, that multiplies 8-bit integer matrices directly in hardware (Arm). For prefill, which is almost entirely matrix multiplication, that is the fast lane, and llama.cpp uses it, both through its own Arm kernels and through Arm's KleidiAI microkernel library.

The catch is that the fast lane only accepts certain shapes. The int8 kernels work on weights stored as plain 8-bit blocks: Q8_0, and the 4-bit Q4_0 format once it is repacked. The popular K-quants, Q4_K_M and Q5_K_M, use a more elaborate block layout, and historically they had no int8 kernel, so they dropped to a slower general-purpose path. KleidiAI's accelerated path is specifically the Q4_0 and Q8_0 formats (Arm).

weight formatfast int8-matmul path?Q8_0native int8 fast pathQ4_0int8 fast path (repacked)Q4_K_Mslower fallback kernelQ5_K_Mslower fallback kernel
Only int8-shaped weights hit the fast int8-matmul kernels. K-quants historically fell back to a slower path. As of mid-2025 Arm has started closing that gap for some K-quants, so the real answer depends on your build.

So the ranking inverts. A Q8_0 model gets the int8 fast path and keeps almost all of its quality. A Q4_K_M model is half the size but takes the slow road through prefill. On a small model on a modest 2-vCPU Arm box, eight-bit ran the prompt at about 237 tokens per second against 171 for four-bit Q4_0. The bigger weights were the faster ones, because the int8 kernel skips the step of unpacking them first.

One caveat, because this corner of the stack moves fast: in mid-2025 Arm contributed int8 kernels for Q4_K and Q6_K as well (Arm). On a current build the gap is narrower than the numbers above; on an old build it is wider. Which is really the lesson. The kernel that runs is a property of your build, not your model, so pin the version and measure.

The free speedup most builds miss

prompt-processing throughput, 1B model, 2-vCPU Arm CPUQ4_0171 tok/sQ8_0237 tok/s8-bit is faster than 4-bit here, and turning on i8mm adds about another third
On a 1B-parameter model on a 2-vCPU Arm server, 8-bit ran the prompt faster than 4-bit. Building with the int8 path on top added roughly another third, at no cost to quality.

The int8 path is opt-in at compile time, and that is easy to get wrong. Build llama.cpp without the right -march features, or without KleidiAI, and it quietly falls back to the slow kernels. Nothing warns you. Turning the int8 path on, same model and same machine, moved one of our prefill benchmarks from roughly 360 to 500 tokens per second, about a third faster for the price of a build flag. Get it wrong in the other direction and the cost is brutal: one reported misconfiguration dropped prompt processing by more than 80 percent, from 16.7 to 2.3 tokens per second, with no error to explain it (llama.cpp issue #10662).

This is the kind of win that never shows up on a model card, because it is not about the weights. It is about whether the binary you built is actually using the instructions your CPU already has.

Three things that save you a bad benchmark

The prefill result is a throughput result. If your workload is long prompts and short answers, prefill dominates and Q8_0 on the int8 path is hard to beat. If you are generating long passages one stream at a time, decode dominates, you are bandwidth-bound, and a smaller quant can pull ahead. Know which regime you are in before you choose.

Quantize from the model's original precision. Most recent models are trained in bf16, and going straight to Q8_0 from a bf16 source keeps those numbers intact, while routing through 16-bit float first can quietly round them off (llama.cpp quantize docs).

And measure on the actual target. This entire effect lives on Arm server CPUs. On an Apple Silicon Mac, llama.cpp runs on the GPU through Metal, a different path altogether, so a Mac will happily hide the regression and mislead you (bartowski). If you are shipping to Graviton or another Neoverse-class server, benchmark there.

The takeaway

The instinct to drop another bit is a memory optimization wearing a speed costume. On an Arm CPU the speed comes from matching what the silicon is good at, which is multiplying 8-bit integers, and the cheapest way to get it is to build correctly and benchmark on the machine you will actually run on. The smaller model is not always the faster one. Measure on the metal.

References and further reading