Artificial Intelligence March 28, 2026

Cerebras CS-3 Lands on AWS: Inside the Fastest Cloud AI Inference

AI inference – the process of actually running a trained model to generate useful output – has quietly become the most expensive and performance-critical bottleneck in modern AI deployments. Training a frontier model grabs headlines, but serving that model to millions of users at interactive speeds is where the real engineering challenge lives. On March 13, 2026, AWS and Cerebras Systems announced a multi-year collaboration that directly attacks this problem: deploying Cerebras CS-3 wafer-scale systems inside AWS data centers, paired with AWS Trainium chips, and accessible through Amazon Bedrock.

The core promise is an order of magnitude faster inference for generative AI workloads. Rather than forcing a single type of processor to handle every phase of inference, the architecture splits the work across two radically different pieces of silicon – each optimized for the specific computational demands it faces. It is the first time a major hyperscaler has hosted Cerebras hardware in its own infrastructure, and it signals a fundamental shift in how cloud-scale AI inference may be architected going forward.

Why Inference Needs a New Architecture

Modern large language model inference consists of two distinct phases. The first, called prefill, processes the entire input prompt in parallel. It is compute-intensive, requires substantial memory capacity, and benefits from massive parallelism. The second phase, decode, generates output tokens one at a time in sequence. Decode is computationally light but brutally memory-bandwidth-bound, since each new token depends on the one before it.

Here is the problem: these two phases have opposite hardware requirements. A chip optimized for the parallel, compute-heavy prefill phase is poorly suited for the sequential, bandwidth-hungry decode phase – and vice versa. Traditional GPU-centric deployments force a single accelerator to handle both, resulting in stranded capacity and suboptimal performance at each stage. As David Brown, AWS Vice President of Compute and ML Services, put it: “by splitting the inference workload across Trainium and CS-3, and connecting them with Amazon’s Elastic Fabric Adapter, each system does what it’s best at.”

This matters more now than ever. Reasoning models that “think” through problems by generating long chains of tokens have become a majority of inference compute. Agentic AI applications – coding assistants, multi-step research tools, conversational agents – produce decode-heavy token profiles where output generation speed directly constrains user productivity.

How Disaggregated Inference Works

The AWS-Cerebras solution implements what both companies call “inference disaggregation.” The input prompt flows first to AWS Trainium3-powered servers, which handle the prefill phase. Once prefill is complete, the intermediate state transfers over high-speed RDMA networking via AWS Elastic Fabric Adapter to Cerebras CS-3 systems, which take over for the decode phase and generate output tokens at extreme speed.

Inference Phase Hardware Key Specs Computational Profile
Prefill (input processing) AWS Trainium3 UltraServer 144 GB HBM, 2.5 PFLOPS compute Parallel, compute-intensive, memory-capacity-heavy
Decode (token generation) Cerebras CS-3 (WSE-3) 900,000 AI cores, 44 GB on-chip SRAM, 21 PB/s bandwidth Sequential, bandwidth-bound, computationally light
Interconnect AWS Elastic Fabric Adapter (EFA) High-speed RDMA Low-latency transfer between prefill and decode stages

The entire stack runs under the AWS Nitro System, ensuring that CS-3 systems and Trainium instances operate with the same security, isolation, and operational consistency that AWS customers expect. This is not a bolt-on or a separate environment – it is native AWS infrastructure.

The Wafer-Scale Advantage for Decode

The Cerebras CS-3 is powered by the Wafer-Scale Engine 3 (WSE-3), a processor roughly the size of a dinner plate containing 4 trillion transistors and 900,000 AI-optimized cores. Its defining characteristic for inference is 44 GB of on-chip SRAM delivering 21 petabytes per second of memory bandwidth – thousands of times greater than the fastest GPU.

Why does this matter for decode? Because generating each output token requires reading model weights from memory. On a GPU, those weights sit in HBM (high-bandwidth memory), which is fast but nowhere near SRAM speeds. The CS-3 stores all model weights directly on-chip, eliminating the off-chip memory bottleneck entirely. For sequential decode operations where each token must wait for the previous one, this bandwidth advantage translates directly into faster token generation.

The WSE-3 is 56 times larger than the largest GPU. That scale means a single chip replaces what would otherwise require clusters of hundreds of GPUs stitched together with complex interconnects, each introducing latency and coordination overhead. The CS-3 itself is a water-cooled appliance roughly the size of a mini-fridge, incorporating the WSE-3 with external memory and networking.

Trainium3 Handles the Heavy Lifting Up Front

AWS Trainium3 brings complementary strengths to the prefill phase. With 144 GB of HBM capacity and 2.5 PFLOPS of compute throughput, it is purpose-built for the parallel, memory-capacity-intensive work of processing large input prompts. Two of the world’s leading AI labs – Anthropic and OpenAI – are committed to Trainium. Anthropic has named AWS its primary training partner, while OpenAI will consume 2 gigawatts of Trainium capacity through AWS infrastructure for frontier models and its Stateful Runtime Environment.

By offloading prefill entirely to Trainium, the CS-3 can dedicate its full compute and memory resources to decode. This eliminates a historical limitation of standalone Cerebras deployments, where keeping the wafer-scale engine continuously fed with data required substantial supporting memory systems. The disaggregated approach resolves that stranded capacity problem, with Cerebras indicating the system can achieve 5x higher throughput by offloading prefill to Trainium compared to handling both phases on one processor type.

Performance Claims and What We Know So Far

The headline claim is inference that is “an order of magnitude faster” than what is available today. More specifically:

Independent benchmarks have not yet been published. AWS has not disclosed pricing for the Cerebras-enhanced inference tier. These are critical unknowns – enterprise customers evaluating a switch from Nvidia GPU-based deployments will need hard performance data and clear cost-per-token economics before committing. The service is expected to launch within the next couple of months from the March 2026 announcement, with full support for leading open-source LLMs and Amazon Nova models coming later in 2026.

Who Is Already Using Cerebras – and Why It Matters

OpenAI, Cognition, and Mistral already use Cerebras hardware for their most demanding inference workloads, particularly agentic coding applications where developer productivity is directly constrained by how fast the model can generate output. These are not toy benchmarks – these are production deployments at some of the most performance-sensitive AI labs in the world.

The AWS partnership extends this capability from purpose-built on-premises installations to hyperscale cloud. Any enterprise with an AWS account will be able to access Cerebras-accelerated inference through Amazon Bedrock, using existing procurement, billing, and governance workflows. Cerebras is available through the AWS Marketplace, meaning no separate vendor relationship is required.

For Cerebras, which has raised over $700 million but struggled to convert technical performance into broad market share against Nvidia’s dominance, the AWS distribution channel is transformative. It instantly makes wafer-scale inference accessible to millions of developers and thousands of enterprises worldwide.

Strategic Implications for the AI Chip Landscape

This collaboration challenges the GPU monoculture that has defined AI infrastructure for the past decade. Rather than scaling up clusters of identical GPUs, AWS is assembling a portfolio of specialized silicon – Trainium for compute-heavy parallel work, Cerebras for bandwidth-bound sequential work, Graviton CPUs for orchestration and control-plane services – connected by Nitro networking into a coherent system.

The competitive dynamics are shifting rapidly. Microsoft Azure has leveraged its OpenAI partnership into rapid AI revenue growth. Google Cloud is pushing its own TPU infrastructure. Groq is winning inference customers with specialized chips. Against this backdrop, AWS needed a differentiated inference story – and disaggregated Trainium-plus-Cerebras architecture provides exactly that.

Analyst perspectives are cautiously optimistic. The Futurum Group views this as the strongest confirmation that inference architecture will enter a phase of deliberate disaggregation in 2026, with workload-specific silicon replacing monolithic GPU deployments for latency-sensitive tasks. Coding agents are seen as the initial growth wedge, given their prompt-heavy prefill demands and decode-intensive output generation. However, questions remain about whether this architectural model can sustain its advantages as model architectures evolve and competing hyperscalers respond.

Practical Guidance for Teams Evaluating This Technology

For organizations considering the AWS-Cerebras inference solution, several practical considerations emerge from the available details:

  1. Access through Bedrock, not raw instances. The CS-3 is not provisioned like an EC2 GPU instance. It is a managed service delivered through Amazon Bedrock, requiring no new instance types or separate APIs.
  2. Match workloads to the architecture. Disaggregated inference delivers its greatest gains for workloads with large input prompts and long output sequences – exactly the profile of agentic coding, multi-step reasoning, and conversational AI applications.
  3. Plan for both modes. Most production environments will benefit from both disaggregated inference (for stable, large workloads) and aggregated inference (for variable prefill-to-decode ratios). Bedrock is expected to enable dynamic routing between modes.
  4. Wait for benchmarks before migrating. No independent performance data or pricing has been published yet. Evaluate against your current GPU-based deployment costs once the service goes live.
  5. Monitor the model roadmap. AWS plans to support leading open-source LLMs and Amazon Nova on Cerebras hardware later in 2026. The initial launch may have a limited model selection.

Conclusion: A Defining Moment for Cloud AI Infrastructure

The deployment of Cerebras CS-3 systems in AWS data centers represents more than a new product announcement. It is an architectural statement: the era of one-size-fits-all GPU inference is giving way to purpose-built silicon pairings that match hardware characteristics to workload demands. Trainium’s 2.5 PFLOPS and 144 GB HBM handle the parallel prefill phase. The CS-3’s 21 PB/s on-chip SRAM bandwidth dominates the sequential decode phase. EFA networking stitches them together under Nitro security.

The claimed order-of-magnitude performance improvement, if validated by independent benchmarks, could reshape inference economics for every enterprise running AI at scale. For now, the technology is validated by production use at OpenAI, Cognition, and Mistral. Broader adoption hinges on pricing, benchmark transparency, and the breadth of supported models when the service launches through Amazon Bedrock in the coming months. What is already clear is that the inference bottleneck – the single biggest constraint on real-world AI value – now has a fundamentally new solution architecture competing for dominance in the cloud.

Sources