Inside the Groq 3 LPU: The Chip Designed for Agent-to-Agent Speed

By Imad Orabi Alnajjar | 2026-03-23 | research

One hundred tokens per second is fast enough for you. It is nowhere near fast enough for your agents. That single observation — that the bottleneck in agentic AI is not model quality but inference latency between machines that talk to each other — is the thesis etched into silicon inside NVIDIA's Groq 3 LPU. And it changes the economics of everything we are building.

When NVIDIA acquired Groq for roughly $20 billion in December 2025 [1], the purchase looked like a talent grab to most observers. Groq had impressive demos but limited production footprint. Six months later, the Groq 3 LPX rack — 256 LPUs packed into 32 liquid-cooled trays — shipped as the inference backbone for the Vera Rubin platform [1]. The numbers are staggering: 315 petaflops of FP8 compute at rack scale, 128 GB of on-chip SRAM, and 640 TB/s of chip-to-chip bandwidth. But numbers without context are just marketing. The question that matters is: what problem does this architecture actually solve that GPUs cannot?

The Speed Humans Don't Need

Here is the tension. GPUs are phenomenal at training and increasingly good at inference. Grace Blackwell NVL72 racks serve hundreds of tokens per second with respectable efficiency. For a chatbot answering a customer question, for a coding assistant streaming suggestions, for a summarization pipeline processing documents — that throughput is more than adequate.

But the workloads arriving in 2026 look nothing like chatbots.

Consider a multi-agent system where a planning agent decomposes a task into subtasks, routes each to a specialist agent, collects their outputs, synthesizes a response, and iterates. That is not one inference call. That is dozens of sequential inference calls, each blocking the next. At 100 tokens per second, a ten-step agentic chain with 500-token intermediate outputs takes 50 seconds. At 1,500 tokens per second, it takes 3.3 seconds. The difference is not incremental improvement — it is the difference between a system that feels broken and one that feels instant.

This is the design target Groq 3 was built around. Not human reading speed. Agent-to-agent communication speed [1]. And that reframing has architectural consequences that ripple through every layer of the chip.

Tensor-First: Rethinking the Compute Unit

Most modern accelerators are organized around general-purpose cores that can be configured for different workloads. The Groq 3 LPU inverts this. Its compute model is tensor-first — the fundamental unit of work is a 320-byte vector, and the chip's three core modules are purpose-built around that primitive [1].

The MXM (Matrix Multiply Module) handles dense matrix operations — the bulk of transformer computation. The VXM (Vector Execution Module) handles pointwise operations: activations, normalizations, the element-wise math that sits between matrix multiplies. The SXM (Stream Execution Module) handles data movement — shuffling tensors between modules and between chips.

This is not a GPU with tensor cores bolted on. It is a dataflow architecture where the entire chip is a tensor processing pipeline. There is no instruction fetch-decode-execute cycle in the traditional sense. The compiler statically schedules every operation, every data movement, every cycle. The hardware executes that schedule deterministically.

The implication for inference latency is profound. On a GPU, dynamic scheduling introduces jitter — variable latency caused by cache misses, memory bank conflicts, warp scheduling decisions, and thermal throttling. That jitter is negligible for training (you are amortizing over billions of operations) but devastating for inference, where you care about the tail latency of every single token. The Groq 3 LPU eliminates this entire class of variance [1]. Per-token latency is not just low — it is predictable. And predictability is what you need when you are chaining dozens of inference calls sequentially.

SRAM-First: The Memory Hierarchy That Isn't

The second architectural bet is equally radical. The Groq 3 LPU has 500 MB of on-chip SRAM per chip, scaling to 128 GB across a full rack, with 40 PB/s of aggregate bandwidth [1]. There is no L1. No L2. No HBM in the traditional sense. The on-chip memory is flat — one large, uniformly addressable SRAM block that Groq calls the MEM Block.

To understand why this matters, think about what happens during a typical transformer forward pass on a GPU. Weights live in HBM. Activations bounce between HBM and SRAM caches. The memory controller makes dynamic decisions about what to cache and what to evict. Every cache miss stalls the pipeline. Every HBM access costs 10-50x the energy and latency of an SRAM access.

The MEM Block eliminates cache misses entirely because there is no cache — everything the chip needs for its assigned computation is already in SRAM, placed there by the compiler ahead of time [1]. The compiler knows the full execution schedule. It knows exactly which bytes are needed at which cycle. It pre-stages everything.

This is where the "but 500 MB is tiny" objection comes in. And it is a fair objection — a single Llama 3 70B model in FP8 needs roughly 70 GB. You cannot fit that in 500 MB. You cannot even fit it in 128 GB of aggregate rack SRAM without compression. The answer to this objection is the third architectural innovation, and arguably the most important one.

Attention-FFN Disaggregation: Splitting the Transformer in Half

The Groq 3 LPX does not try to run an entire transformer forward pass on LPUs. Instead, NVIDIA introduced Attention-FFN Disaggregation (AFD) — a serving architecture that splits the transformer along its natural fault line [1].

Every transformer layer has two major blocks: attention (which reads from and writes to the KV cache, handling the model's memory of context) and the feed-forward network (which transforms each token's representation independently). These two blocks have radically different computational profiles.

Attention is memory-bound. It needs access to the full KV cache, which grows linearly with sequence length. For a 128K-context model, the KV cache alone can consume tens of gigabytes. Attention also has irregular memory access patterns — each token attends to different positions depending on the attention mask.

Feed-forward layers are compute-bound. They are large matrix multiplications with predictable, regular access patterns. They are also where Mixture-of-Experts (MoE) architectures spend most of their time — routing tokens to different expert sub-networks.

AFD assigns each block to the hardware best suited for it. Vera Rubin GPUs handle attention — they have the HBM capacity for massive KV caches and the flexible memory hierarchy for irregular access patterns. Groq 3 LPUs handle feed-forward and MoE expert layers — they have the deterministic execution model and SRAM bandwidth for latency-sensitive, compute-dense, regular matrix operations [1].

NVIDIA Dynamo, the orchestration layer, manages the disaggregated serving pipeline: routing tokens between GPU attention nodes and LPU feed-forward nodes, balancing load, handling batching [1]. The result is that each hardware type does only what it is best at, and neither wastes cycles on workloads it was not designed for.

This is not just clever engineering. It is a fundamental rethinking of what "running a model" means. We have spent years treating inference as a monolithic operation — give a chip the model, give it a prompt, get tokens out. AFD says: the model is a pipeline, and different stages of that pipeline should run on different silicon.

The Chip-to-Chip Fabric

Disaggregation only works if the interconnect between chips is fast enough that splitting the computation does not introduce more latency than it saves. This is where the Groq 3 LPU's interconnect design earns its keep.

Each LPU has 96 chip-to-chip links running at 112 Gbps each, providing 2.5 TB/s of bandwidth per chip and 640 TB/s at rack scale [1]. These links use a plesiosynchronous clocking scheme — the chips run at nearly identical frequencies but are not phase-locked, which simplifies the physical design while maintaining deterministic communication timing.

High-radix interconnects are not new. What is new is the combination of high radix with deterministic scheduling. Because the compiler knows the full execution schedule, it also knows exactly when each chip will send and receive data. There is no arbitration, no contention, no backpressure. Data arrives when it is expected to arrive. This is what makes it possible to distribute a feed-forward computation across multiple LPUs and still maintain single-digit-microsecond latency between steps.

For context, GPU-to-GPU communication over NVLink involves arbitration, flow control, and variable latency. It is fast — hundreds of GB/s — but not deterministic. When you are chaining 80+ layers of a transformer, each requiring a communication step, that per-step variance compounds.

35x Per Megawatt: The Number That Changes the Business Case

The headline performance claim is 35x higher throughput per megawatt compared to Grace Blackwell NVL72 at 400 tokens per second [1]. Strip away the marketing, and this number reflects three things.

First, SRAM is dramatically more energy-efficient than HBM. Every bit read from SRAM costs roughly 100x less energy than a bit read from HBM. When your entire working set fits in SRAM, you eliminate the single largest energy cost in inference.

Second, deterministic execution eliminates the energy wasted on speculation, branch prediction, cache coherency, and dynamic scheduling. These mechanisms exist because general-purpose processors do not know their workload in advance. The LPU does.

Third, liquid cooling at rack scale allows higher sustained clock rates and denser packing. The 32-tray, 1U-per-tray design is engineered for datacenter density — 256 chips in a single rack [1].

The per-megawatt metric matters because inference is becoming the dominant cost in AI infrastructure. Training a frontier model is a one-time (well, periodic) expense. Serving it to millions of users — and increasingly, to millions of agents — is an ongoing operational cost. NVIDIA projects $1 trillion in Vera Rubin orders through 2027 [1], and that projection is built on the assumption that inference demand will dwarf training demand as agentic workloads scale.

For engineering teams making infrastructure decisions, the 35x efficiency claim translates directly to cost. If you are running a multi-agent system that makes thousands of inference calls per user request, and you can serve those calls at 35x lower energy cost, your unit economics shift from "interesting research project" to "viable production system."

What This Means for Agent Architecture

The existence of hardware purpose-built for agent-to-agent speed has second-order effects on how we design agent systems.

Tighter feedback loops become viable. Today, most multi-agent architectures minimize the number of inference calls because each call is expensive and slow. Agents are designed to do as much as possible in a single pass. With 1,500+ tokens per second and predictable latency, you can afford architectures where agents have short, frequent exchanges — more like a conversation than a batch job. Debate-style architectures, where multiple agents critique and refine each other's outputs, become practical at interactive speeds.

Streaming agent protocols get interesting. When token generation is fast and deterministic, downstream agents can start processing partial outputs before the upstream agent finishes. You stop thinking about agent communication as request-response and start thinking about it as streaming pipelines. This is a qualitative shift in system design, not just a quantitative speedup.

MoE models become the default for agentic inference. AFD is purpose-built for Mixture-of-Experts. The LPU's regular compute patterns and deterministic scheduling are ideally suited for routing tokens to expert sub-networks. As MoE architectures continue to dominate the frontier (they already power most of the largest models in production), hardware that is optimized for sparse expert execution will have a compounding advantage.

The cost curve bends toward always-on agents. At 35x better efficiency per megawatt, the marginal cost of keeping an agent "thinking" drops dramatically. Architectures where agents run persistent background reasoning — monitoring, planning, pre-computing likely responses — become economically rational rather than extravagant.

The Compiler Is the Product

There is one aspect of the Groq 3 architecture that does not appear in any spec sheet but may matter more than any of the numbers: the compiler.

Deterministic execution means the compiler must produce a complete, cycle-accurate schedule for every operation and every data movement across every chip in the system. This is not a conventional compiler optimization problem. It is closer to place-and-route in chip design — a combinatorial optimization over millions of operations with hard timing constraints.

The quality of the compiler directly determines the performance of the hardware. A chip that can theoretically deliver 315 PFLOPS but whose compiler can only effectively schedule 60% of its compute units delivers 189 PFLOPS in practice. This is why the Groq acquisition was not just about the silicon — it was about the compiler team and their years of experience with deterministic scheduling on previous-generation LPUs.

For developers, this is mostly invisible. You send a model and a request to an API, and tokens come back fast. But it is worth understanding that the performance ceiling of this architecture is set by software, not hardware. As the compiler improves, the same silicon gets faster — something that is not true of conventionally scheduled architectures, which are already extracting most of their theoretical performance.

The Bet

The Groq 3 LPU is a bet on a specific future: one where the dominant AI workload is not a human typing a question and reading an answer, but a fleet of agents talking to each other at machine speed, making decisions, delegating tasks, and synthesizing results in tight loops that would be intolerably slow on hardware designed for human interaction.

It is a bet that inference efficiency matters more than training efficiency for the next phase of AI infrastructure. That deterministic execution beats dynamic scheduling when latency variance is the enemy. That splitting the transformer along its natural fault line and running each half on purpose-built silicon beats running the whole thing on one general-purpose chip.

These are not safe bets. GPUs are improving rapidly. Attention mechanisms are getting more efficient. New architectures may eliminate the attention-FFN split entirely. But as of March 2026, with multi-agent systems moving from research demos to production deployments, the Groq 3 LPU is the first piece of silicon that takes the agent-to-agent communication bottleneck seriously — and builds an entire architecture around eliminating it.

The chips are shipping. The racks are filling datacenters. And the agents are waiting for hardware that can keep up with them. That wait just ended.


References

[1] NVIDIA Developer Blog — Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator. Blog