Curing GPU Brain: Why batching destroys CPU inference

Here's a question. Google how to speed up machine learning inference, and every tutorial, optimization guide & Stack Overflow answer tells you the exact same thing: batch your inputs. So how did the team behind Manticore Search ship a 14x speedup on local ONNX embeddings by explicitly turning batch processing off?

When they published this field report, the Hacker News reaction was a mix of surprise and sudden clarity. Because the idea that processing one item at a time is faster than processing a batch violates a decade of inherited wisdom.

I call this inherited wisdom GPU Brain.

For the better part of ten years, developers building against cloud APIs have been trained that batching is the universal speedup hack. It's a mental model handed down from massive parallel architectures, where throughput is king and hardware is infinitely wide. We've spent so long designing software for massive data centers that we just assume the laws of physics at cloud scale apply cleanly to the edge.

They don't.

A GPU is essentially a stadium full of tiny calculators working simultaneously. A CPU is a single, incredibly fast savant reading a ticker tape one number at a time. Hugging Face's own documentation on the topic points out that CPUs are inherently sequential.

And this architectural difference introduces a massive, hidden penalty when we port our cloud habits to local devices. I think of it as the Padding Tax.

Let's do the math.

Imagine you're building a local agent. It needs to embed four sentences to search a vector database. Your text chunks naturally vary in length. Let's say they are 10, 20, 100, and 50 tokens long, respectively.

If you count them up, that is exactly 180 tokens of actual, meaningful data.

If you process these sequentially on a CPU—literally just throwing them into a standard loop of size four—the processor grinds through exactly 180 operations. It looks at a token, computes the math, and moves to the next.

But GPU Brain tells you to batch them. So you grab your library of choice and hand the four strings to ONNX or PyTorch as a single batch.

Here is the problem. Tensors cannot be jagged. The underlying math of deep learning demands perfect rectangles. To process these four sentences together, the matrix must be uniform. So the library silently steps in and pads your shorter inputs with zeroes until they match the length of the longest input in the batch.

Your four sentences just became a 4x100 matrix.

Your 180 tokens just became 400 tokens.

[Diagram: A visual block diagram of the 4x100 matrix, shading the 180 real tokens in solid blue and the 220 Padding Tax zeroes in empty gray to instantly convey the volume of wasted sequential cycles.]

Now watch what happens when this hits the hardware.

If you send that 4x100 matrix to an Nvidia H100, the GPU computes all 400 blocks instantly in parallel. It doesn't care that more than half the tensor is empty space. The time cost is completely flat.

But a CPU has to step through that matrix sequentially.

When you batch these variable-length strings locally, you are forcing your processor to sequentially multiply 220 zeroes, one at a time. The math here is brutal and unavoidable. Running 180 operations back-to-back is fundamentally faster than running 400 operations in a batch, because more than half the batch is literal nothingness.

That's the Padding Tax. You are paying expensive computational cycles to process empty space.

And the tax compounds. Because there is a second symptom of GPU Brain buried deep in the defaults of our standard machine learning libraries: the Anxious Thread.

If you look at the ONNX Runtime documentation for performance tuning, you'll notice a thread management setting called intra_op_spinning. By default, this is turned on.

What does it do? It forces the CPU threads to aggressively spin-wait. They sit there polling constantly, burning cycles to look for parallel work so they can grab it the microsecond it appears. In a server rack, this makes perfect sense. You want those threads hungry and ready to catch the massive volume of parallel operations raining down from a web server.

But on a laptop running sequential text processing? That parallel work simply doesn't exist.

The CPU burns local cycles.. and your battery.. polling frantically for tasks that are never coming. It's a concurrency optimization that actively cannibalizes local performance. Disabling this spinning behavior was the second half of the massive Manticore speedup.

As we push more AI from the cloud down to edge devices, our cloud-native instincts become active liabilities. We have to stop designing local software using mental models inherited from server farms. The architectural optimizations that make a massive API request fast are the exact same behaviors that make a consumer laptop grind to an agonizing halt.

We have to relearn how to write software for the hardware actually sitting on the desk.

Turns out the greatest performance hack in local AI is making your data take a number and stand in line.

Unless you're paying for cloud GPUs.

Then by all means, compute the zeroes.