Hardware/Model Co-Design in Modern AI

Beyond "why", let's talk about speculative decoding and kv cache

Dec 02, 2025

If you’re building AI systems today, it’s tempting to reach for whatever model API is cheapest or grab whichever open-source model has the most GitHub stars. Maybe you’re fine-tuning LLaMA, running inference on whatever GPUs you can get access to, and calling it a day. That’s totally reasonable for getting stuff done right now.

Beneath the surface, the inference chip landscape is fragmenting in fascinating ways. Groq’s LPUs are achieving 241+ tokens per second through deterministic scheduling. Cerebras is running models on wafer-scale chips that eliminate inter-chip communication entirely. These aren’t just “faster GPUs.” They represent fundamentally different computational paradigms, and they’re going to change what kinds of model architectures actually work well in production.

This matters because the next generation of models won’t just be “bigger transformers.” The architectural innovations that define the next few years will be shaped by what these diverse chips can actually do efficiently. If you’re doing research on model architectures, thinking about what to build next, or just want to understand where this field is headed, the hardware landscape needs to be part of your mental model. The chips aren’t just the substrate we run models on anymore. They’re actively constraining and enabling what’s possible.

Image source: Nano Banana :)

Why Inference Is Different from Training

Training and inference have fundamentally different computational characteristics, and modern chip designs reflect this divergence. During training, we’re compute-bound. Performing backpropagation across massive batches benefits from the parallel processing power of thousands of GPU cores working simultaneously. During inference, particularly for autoregressive generation, we’re memory-bound. Each token requires loading model weights from memory, and the GPU cores often sit idle waiting for data.

This memory bottleneck is why identical FLOPS ratings can produce wildly different inference performance across chip architectures. The key metric shifts from raw computational throughput to memory bandwidth and how efficiently weights can be fed to compute units.

Architectural Divergence: GPU, LPU, and Wafers

GPU: Flexible but Memory-Limited

GPUs like NVIDIA’s H100 remain the dominant choice, leveraging High Bandwidth Memory and mature ecosystems. An H100 pairs compute with HBM3, but even “high bandwidth” memory creates bottlenecks during sequential token generation. The architecture excels at parallel operations but wasn’t designed for the specific memory access patterns of autoregressive inference.

This explains why techniques like batching multiple requests work well on GPUs. They increase compute utilization by processing many sequences in parallel, hiding some of the memory latency. However, batching increases latency for individual requests, creating tension between throughput and responsiveness.

LPU: Deterministic and SRAM-Based

Groq’s Language Processing Unit represents a radical departure. Rather than relying on off-chip HBM, the LPU integrates hundreds of MB of SRAM directly on the chip, positioned near compute cores. This architecture enables what Groq terms “deterministic” computing—the compiler can predict exactly when data will arrive, eliminating the unpredictability that hampers GPU performance during sequential generation.

The LPU uses a software-first design where the compiler performs static scheduling before any silicon runs. This approach trades flexibility for predictability and speed. Groq’s architecture stores model weights in on-chip SRAM rather than cache, cutting latency and enabling efficient tensor parallelism across chips for fast, scalable inference. The result is impressive: independent benchmarking shows throughput exceeding 241 tokens per second, significantly faster than GPU-based alternatives for certain workloads.

The trade-off? Groq requires hundreds of chips clustered together to run large models, since each LPU has limited on-chip memory. The economics work because the per-token energy cost is substantially lower than GPUs, despite higher initial capital expenditure.

Wafer-Scale: Eliminating Inter-Chip Communication

Cerebras takes yet another approach with its Wafer Scale Engine, which doesn’t cut a silicon wafer into individual chips but instead uses the entire wafer as a single processor. The WSE-3 contains 4 trillion transistors, 900,000 AI cores, 44GB of on-chip SRAM, and delivers 125 PetaFLOPS of AI performance.

By keeping everything on one wafer, Cerebras eliminates the networking overhead that plagues GPU clusters. The WSE-3 achieves 21 petabytes per second memory bandwidth, with data movement latency drastically lower by integrating compute cores and communication fabric within a single wafer-scale chip. For inference specifically, Cerebras achieves speeds above 1,800 output tokens per second on Llama 3.1 8B and above 446 tokens per second on Llama 3.1 70B.

The wafer-scale approach particularly benefits long-context models where massive amounts of data must move between compute and memory. Traditional GPU setups spend significant energy and time on inter-GPU communication, but Cerebras bypasses this entirely.

How Architecture Influences Model Design

These architectural differences should inform how you design and optimize models for deployment. The hardware isn’t a passive substrate. It actively shapes what performs well (and what doesn’t).

KV Cache: The Hidden Architectural Constraint

The key-value cache has become central to inference optimization, but optimal strategies vary dramatically by hardware. During autoregressive generation, we cache the keys and values from attention computations for previously generated tokens, avoiding redundant calculations. However, this cache grows linearly with sequence length.

On GPUs with limited HBM, KV cache size directly limits how many sequences can be processed concurrently and how long contexts can be. This has driven innovations in KV cache compression: techniques like TailorKV enable serving Llama-3.1-8B with 128k context within a single RTX 3090 GPU, reaching 82 ms per token during decoding. Other approaches include quantizing the cache to lower precision, selectively evicting less important tokens, or using attention sink patterns to retain only critical information.

On architectures like Cerebras with massive on-chip memory, the calculus changes. The 44GB of SRAM on WSE-3 means KV cache size becomes less of a constraint, potentially allowing models to maintain full precision throughout longer contexts without compression.

The practical implication: if designing models for GPU deployment, architect for efficient KV cache usage from the start. Consider alternatives like Grouped Query Attention (GQA), which reduces KV cache size by sharing key-value pairs across multiple query heads. GQA significantly reduces KV cache size and inference time compared to vanilla multihead self-attention.

Check out this implementation guide to take a closer look:

Ahead of AI

Understanding and Coding the KV Cache in LLMs from Scratch

KV caches are one of the most critical techniques for efficient inference in LLMs in production. KV caches are an important component for compute-efficient LLM inference in production. This article explains how they work conceptually and in code with a from-scratch, human-readable implementation…

a year ago · 416 likes · 35 comments · Sebastian Raschka, PhD

Speculative Decoding: Hardware-Dependent Acceleration

Speculative decoding has emerged as a powerful inference optimization, but its effectiveness varies significantly across chip architectures. The technique uses a smaller draft model to predict multiple tokens quickly, then verifies these predictions with the larger target model in a single forward pass. When predictions are correct, you generate multiple tokens for the cost of one verification step.

Speculative decoding accelerates large language models by predicting and verifying multiple tokens simultaneously, reducing latency while preserving output quality. The acceleration depends critically on the acceptance rate (how often the draft model’s predictions match what the target model would generate) and the relative speeds of draft and target models.

Different hardware architectures favor different speculative decoding strategies. On GPUs, the memory-bound nature of inference means the target model verification step doesn’t cost much more than single-token generation, making speculative decoding highly effective at low batch sizes. However, the draft and target models share VRAM, and both maintain separate KV caches, limiting scalability.

Groq’s deterministic architecture enables particularly efficient speculative decoding. LPUs are designed with an architecture that can handle verification of speculative token batches more efficiently with pipeline parallelism, allowing multiple tokens to be accepted per pipeline stage. The predictable timing and high memory bandwidth mean draft model generation and target model verification can be tightly coordinated.

Recent advances like EAGLE-3 eliminate the need for a separate draft model entirely, instead using lightweight prediction heads attached to the target model’s internal layers. This approach reduces memory pressure and simplifies deployment, though it requires model-specific training.

For researchers working on new architectures, consider: does your design facilitate multi-token prediction? Can internal representations be exposed for draft generation? These questions become architectural concerns, not just algorithmic ones.

Model Architecture and Hardware Co-Design

The interaction between model architecture and chip design is becoming increasingly important. Some architectural choices that seem theoretically equivalent perform very differently across hardware.

Autoregressive vs. parallel generation: Traditional transformer architectures generate tokens sequentially, which maps poorly to massively parallel hardware. Explorations in parallel generation (like diffusion models for discrete sequences or non-autoregressive approaches) could leverage parallel hardware more effectively, but require fundamental rethinking of model design.

Attention patterns and memory access: The attention mechanism’s memory access patterns significantly impact performance. Sparse attention patterns that reduce the quadratic complexity also change memory access patterns in ways that can help or hurt depending on hardware. Flash Attention and its variants optimize attention for specific memory hierarchies. These optimizations are deeply tied to GPU cache architecture and won’t necessarily transfer to other chips (though that’s another conversation!).

Quantization and numeric precision: Different chips support different numeric formats. While GPUs increasingly standardize on FP16/BF16 for training and INT8 for inference, Groq uses TruePoint numerics which reduces precision only in areas that do not reduce accuracy, storing 100 bits of intermediate accumulation while keeping weights and activations at lower precision. Designing quantization schemes that work well across multiple chip types requires understanding these numeric capabilities.

Mixture of Experts (MoE): MoE architectures reduce computational cost by activating only subsets of parameters per token. However, their efficiency depends heavily on how hardware handles dynamic routing. GPUs with flexible scheduling can adapt, but more specialized chips may struggle with the unpredictable memory access patterns MoE creates. Conversely, if your hardware can efficiently swap expert weights in and out of fast memory, MoE becomes more attractive.

Practical Implications for Model Development

If you’re developing models with deployment in mind, these hardware considerations should influence architectural decisions early:

Design for your target deployment hardware: A model optimized for H100 clusters may not perform optimally on Groq’s LPUs or Cerebras systems. Consider the memory hierarchy, numeric precision support, and parallelism characteristics of your target platform.

Make KV cache optimization a first-class concern: Don’t treat KV cache as an afterthought. Architectural choices like GQA, attention sink patterns, or carefully designed context window strategies should be evaluated during model development, not just at deployment.

Consider speculative decoding from the start: If your deployment will use speculative decoding, design models that facilitate it. This might mean ensuring consistent tokenization across model sizes, architecting for easy extraction of intermediate representations, or training model families with compatible distributions.

Think beyond transformers: While transformers dominate current AI, their architectural assumptions (e.g. sequential generation, quadratic attention complexity) create fundamental tensions with hardware. State space models like Mamba offer different computational characteristics that may map better to certain hardware architectures. Hybrid approaches that combine transformer-like and alternative mechanisms might better exploit hardware capabilities, at least right now.

Profile early and often: Don’t assume performance characteristics. Profile models on target hardware early in development. The bottleneck might not be where you expect. Memory bandwidth often dominates over compute, and seemingly minor architectural changes can have outsized effects.

The Evolving Landscape

The inference chip market is evolving rapidly, with new entrants and approaches emerging regularly. IBM and Groq have partnered to integrate GroqCloud into IBM watsonx Orchestrate, combining Groq’s inference speed with IBM’s agentic AI capabilities. Cerebras continues expandnig their data center footprint to increase inference capacity. Traditional GPU vendors are optimizing for inference workloads.

This competition is healthy for the field. As hardware diversifies, we’ll see model architectures evolve to exploit different computational paradigms. The tight coupling between algorithm and hardware that characterized earlier computing eras is returning, though at a much higher level of abstraction.

For researchers and practitioners, the key insight is this: hardware isn’t just a deployment concern. The chip architecture you target should inform model design decisions from the beginning. As specialized inference chips proliferate, understanding their strengths and limitations becomes essential for building practical, efficient AI systems.

As always, all opinions are my own and do not reflect those of any employer or funding agency.

Directed Research

Discussion about this post

Ready for more?