4 Comments
User's avatar
Pashenko's avatar

Great article. The first thing I learned when I started to study AI GPUs is that the bottleneck is memory, and you have to stack multiple NVidia GPUs together to pile up 2.05T of RAM to load a 1T parameter model like Kimi K2.5 without quantisation. And now Cerebras is betting their whole business on the "10x-faster but 10x-smaller" mode. Cerebras isn't even compatible to HBM, apparently. Let's see.

The Synthesis's avatar

Right, they traded HBM for on-wafer SRAM: huge bandwidth, tiny capacity. That's why a 1T model has to be sharded across a rack of wafers instead of stacked in one box, which ties straight into the tensor-vs-pipeline parallelism tradeoff the piece walks through. The bet is that interconnect latency, not raw capacity, is what caps inference speed. Whether that holds as models keep growing is the part worth watching.

Ex-Consultant in Tech's avatar

Love the article. The interesting Cerebras question is whether they can create a new premium inference category where latency is the product.

Most inference probably does not need this. A coding agent, research agent, or enterprise workflow can generate tokens faster and still feel slow because the bottleneck is somewhere else: tool calls, retrieval, verification, approvals, retries, crappy SaaS APIs. In those cases, faster decode is nice but not necessarily monetizable.

The Synthesis's avatar

https://thesynthesisai.substack.com/p/the-bottleneck make your case empirically: AI cut reporting time 28% and moved time-to-diagnosis by exactly zero, because scheduling and handoffs ate the gains. Cerebras earns its premium only where decode is the whole loop, like live trading or agents that can't batch. Everywhere else the saved milliseconds just queue behind the next approval.