Every AI infrastructure conversation starts with the same object of worship: the GPU.

How many FLOPS? Which NVIDIA generation? How many racks? How many megawatts?

But in the real machine the GPU is often not the bottleneck. Memory is.

More precisely: how quickly data can move from where it lives to where computation needs it.

A modern GPU can perform trillions of operations per second. But if the right data is not at the right place at the right moment, the chip sits idle. Burning power. Doing nothing.

This is the memory wall.

And it becomes more important in a world of a billion AI agents.

An agentic workload looks structurally different. An AI agent reasons across a long task. It calls tools like search, code execution, browser automation, database queries. This means more CPUs are needed, not just GPUs. Let’s not forget that every CPU needs memory too.

And the market has noticed.

Tessara’s Memory Regime Score reads 88, at all-time highs

Our Memory Regime Score, a macro indicator which tracks how constrained the AI memory market is, reads 88 and literally at its all time highs. (you can see the score live here)

The signal is showing up everywhere at once. DRAM contract prices nearly doubled in Q1 2026. NAND flash jumped over 50% in the same quarter.

The reason is that the big 3 memory manufacturers - Samsung, SK Hynix, and Micron - are shifting production capacity toward high-bandwidth memory and AI server applications. Conventional memory supply is getting squeezed across every other segment. Micron told investors it can only meet about 50 percent of key customer demand. The shortage is not expected to ease before 2028.

This is not a normal cyclical shortage. It is a structural reallocation. AI's appetite for memory, at every tier of the stack, is repricing the entire semiconductor memory market.

Micron crossed a $1 trillion market cap on May 26, capping a roughly 840% one-year run. SK hynix and Samsung have rerated alongside.

Micron’s is up over 850% in one year.

The trade is not over though. We laid out the full setup in our memory thesis, back when Micron was still at $500M+ market cap.

The underlying gap explains why. Compute FLOPs has scaled roughly 8x from A100 to B200. Memory bandwidth has scaled about 4x. Capacity, just 2.4x. The gap between what the chip can do and what the memory can feed it grows with every generation. But now that gap is showing up in contract negotiations, earnings calls, and component shortages worldwide.

Memory is the bottleneck. It is the cost driver. And increasingly, it is the thing that determines which models can be deployed and by whom.

Our research piece walks you through the memory hierarchy in AI and shows what sits where, what each layer solves, and why it all matters more than the chip itself.

Why Memory Became AI's Binding Constraint

The memory wall is not new. Compute has outpaced memory for decades. But in the past 5-7 years, AI made the gap existential.

  • Across recent NVIDIA data center GPU generations, compute performance has increased much faster than memory bandwidth. Newer chips can perform far more operations per second, but feeding those compute units with data has become increasingly difficult. That is why high-bandwidth memory and advanced packaging now matter so much.

  • Models exploded in size: Models crossed a trillion parameters in 2025. OpenAI's GPT-5 is >1 trillion dense parameters. Multiple open-source models from DeepSeek, Moonshot, and Alibaba each exceed a trillion through mixture-of-experts architectures. Even the "small" frontier open-source models today sit at 70 to 400 billion parameters. That is a roughly 1,200x increase from GPT-2 in 2019. Single-GPU memory grew only about 10x over the same period

  • Context windows blew up: Gemini 3 Pro supports 2 million tokens, and Llama 4 Scout supports up to 10 million. Many frontier models now offer 1 million tokens as a standard feature. Every token added to the context creates a key-value, or KV, cache entry. This is the model’s working memory of what it has already read. For a 70B model, each token adds roughly 320 KB to the KV-cache. At 128K tokens, one conversation can consume about 40 GB of KV-cache.

  • Training and inference pressure memory in different ways: Training processes large batches of data at once. Because the GPU performs a great deal of computation on each batch, it stays busy enough that memory can often keep pace. Training still requires huge amounts of total memory. A 70B model can require roughly 1.12 TB just for optimizer states. But in many training workloads, the primary limit is total compute, not how quickly memory can deliver data.

Inference is different. Tokens are generated one at a time. For each new token, the system has to read the model’s parameters from memory to produce the next output, then repeat that process again for the following token. In a 70B model, that can mean repeatedly reading around 140 GB of weights from memory just to generate the next word. As a result, the GPU often spends more time waiting for data than performing computation

This is the core tension At training time, memory capacity is the constraint. At inference time, memory bandwidth is. And as context windows grow, capacity becomes a problem at inference too, because the KV-cache keeps expanding.

That distinction matters because people often use the word “memory” as if it refers to a single bottleneck. It does not. In AI systems, memory pressure usually comes from three separate dimensions.

Capacity, Bandwidth, and Latency: The Bottleneck Is Not One Thing

To be precise:

  • Capacity is how much data the system can hold at once. Can the model fit in memory? Can the KV-cache fit alongside it?

  • Bandwidth is how quickly data can move between memory and compute. Can the system feed the GPU fast enough to avoid idle cycles?

  • Latency is how long it takes to retrieve data once requested. How quickly can the system respond when it needs data that is not already close at hand?

Most memory technologies improve one or two of these dimensions, not all three at once.

HBM solves bandwidth but its access latency is actually similar to regular DRAM. Offloading weights to an SSD solves capacity but even a cutting-edge SSD is significantly slower than HBM at moving data. SRAM offers very low latency but it is too expensive and area-constrained to provide at large capacity.

AI systems do not rely on a single memory type because no single memory type can satisfy every requirement at once. Instead, they stack different tiers of memory, each optimized for a different part of the problem.

That is the right lens for understanding the modern memory stack.

The Memory Stack: How the Memory Hierarchy Is Organized

An AI system uses four tiers of memory, stacked by proximity to the processor.

At the top: tiny, blazing fast, expensive. At the bottom: massive, slow, cheap. No single memory technology is fast, cheap, and high-capacity at the same time. If it were, we would not need the hierarchy at all.

Some of these memories are volatile, meaning the data vanishes the instant you cut power. SRAM, DRAM, and HBM all fall into this category. They are working memory. Fast and temporary.

Others are non-volatile, which means data remains stored even when the system is turned off. NAND flash, which underpins SSDs, is the main example in AI systems. This is storage. Slow and permanent.

The hierarchy stacks them by proximity to the processor:

  • SRAM sits on the chip itself, nanometers from the compute cores.

  • HBM sits on the same package. Millimeters away, connected through silicon.

  • DRAM sits on the motherboard. Centimeters away, connected through copper traces on a circuit board.

  • NAND/SSDs sit off-board entirely, connected through a cable or bus.

Each step down trades speed for capacity. Each step up trades capacity for speed. The entire art of AI hardware design is deciding what data lives where, and moving it between tiers as efficiently as possible.

SRAM: The Fastest Memory You Will Never Have Enough Of

SRAM is the fastest memory in the system. It sits directly on the GPU die, closest to the compute cores, and can be accessed in under a nanosecond. It also delivers far more bandwidth than main memory. This is where the core math of AI workloads is fed and executed.

What SRAM solves is speed. What it does not solve is capacity.

You only get a tiny amount of it. An H100 has about 50 MB of on-chip SRAM. A B200 has about 126 MB. SRAM is extremely expensive in chip area, so every bit of SRAM added is space that cannot be used for more compute. SRAM is effectively $5000+ per GB when you account for the chip real estate it consumes.

This is why so much AI software work exists in the first place. Techniques like FlashAttention, tiling, and operator fusion are all ways of using a small amount of SRAM as efficiently as possible: load a chunk of data, do as much work on it as you can, then fetch the next chunk from slower memory. Get it wrong, and most of the GPU sits idle.

SRAM solves the speed problem. Its limited size is what makes the rest of the memory hierarchy necessary.

DRAM: The Workhorse That Cannot Keep Up

DRAM lives on separate chips soldered onto the motherboard, physically separate from the GPU. The data travels centimeters, not nanometers.

Much more room. Much cheaper. But every delivery takes time.

The DRAM family has several variants:

  • DDR5, the standard memory used in servers and desktops

  • LPDDR5X, the lower-power version used in laptops. mobile devices and increasingly in server CPUs.

  • GDDR, the graphics-focused version used in consumer GPUs

HBM is also a form of DRAM, but with a very different physical design built to deliver far higher bandwidth.

What standard DRAM solves: Capacity at reasonable cost. Servers can carry terabytes of DDR5. Laptops can carry 128 GB of LPDDR5X. It is affordable, mature, and everywhere.

What standard DRAM cannot solve: feeding modern AI accelerators fast enough. A high-end DDR5 server delivers only a fraction of the bandwidth required by top GPUs. An H100 needs about 3.35 TB/s of memory bandwidth, and a B200 needs roughly 8 TB/s. Standard DRAM falls well short.

That bandwidth ceiling is precisely the problem that HBM was invented to fix.

HBM: The Tier That Runs AI

HBM is still DRAM, but it is packaged very differently from conventional memory. Instead of sitting farther away on the motherboard, HBM is placed close to the GPU in the same package and linked through advanced silicon interconnects. That physical arrangement is what gives it such a large bandwidth advantage.

It’s a very simple idea, actually: shorten the distance and widen the data path.

HBM shortens the path by moving memory close to the processor. It widens the path by giving each memory stack a very large interface, commonly 1,024 bits per stack in HBM3 and HBM3E. HBM4 doubles that to 2,048 I/Os, pushing bandwidth even further. By comparison, a standard DDR5 memory channel is 64 bits wide. The silicon interposer is what makes those extremely wide, dense, short-range connections practical.

HBM also goes vertical. Multiple DRAM dies are stacked on top of one another and connected with through-silicon vias, or TSVs, which act like elevator shafts running through a skyscraper, letting every floor communicate directly. TSVs are challenging to manufacture. Current HBM products commonly use 8-high and 12-high stacks.

The result is a huge jump in bandwidth. A single HBM3E stack can deliver roughly 1 TB/s, while a B200-class GPU reaches up to 8 TB/s of total HBM bandwidth. That is how HBM pushes past the memory wall: by changing how DRAM is arranged and connected.

Why AI Pulled HBM Into the Center

HBM existed before the AI boom. It was originally built for graphics and high-performance computing. But large language models turned it from a specialized memory technology into a critical part of modern AI systems.

The reason is simple: serving large models is often limited by how fast data can move from memory to the chip.

When a model generates text, it has to repeatedly pull its weights from memory to produce each new token. For a 70B model in FP16, that means working with roughly 140 GB of model weights again and again during inference. In many cases, the speed of output depends less on raw compute and more on how quickly memory can feed the processor.

Two main generations serve the current AI fleet:

  • HBM3 (2022) delivers roughly 819 GB/s per stack with 16–24 GB capacity. The H100 uses five stacks: 80 GB at 3.35 TB/s total.

  • HBM3E (2023) pushes per-stack bandwidth to ~1 TB/s. The H200 uses six stacks: 141 GB at 4.8 TB/s. The B200 uses eight stacks: 192 GB at 8 TB/s. Google’s Ironwood TPU is in the same range.

Each chip is smaller than a penny. Total HBM bandwidth scales mainly with the number of stacks and the bandwidth of each stack, while stack height mainly affects capacity per stack.

Capacity matters alongside bandwidth. The progression from H100 at 80 GB, to H200 at 141 GB, to B200 at 192 GB makes it possible to run larger models on a single chip and support longer contexts before KV-cache pressure forces work across multiple accelerators.

What HBM solves: enough bandwidth to keep modern AI accelerators productively fed during inference.

What HBM does not solve: cost, and increasingly capacity. Even 192 GB is still not enough for the largest models or the longest contexts without splitting work across multiple chips.

Why HBM Is Expensive and Supply-Constrained

Just 3 companies on earth absolutely dominate the HBM market, cornering all of the supply:

  • SK Hynix

  • Samsung

  • Micron

That is the entire global supply chain for the most critical component in AI hardware.

HBM is why AI GPUs cost what they do. Analyst teardown estimates place HBM at 40-50% of the H100's manufacturing cost. On the B200, HBM accounts for approximately 45% (around $2,900 of ~$6,400 total). Memory, not the GPU logic, is the single largest cost component. And that share is growing.

Why so expensive? Three reasons that compound on each other.

  • Manufacturing is brutal: Stacking eight DRAM layers with microscopic copper connections requires extreme precision. Yields compound multiplicatively. 99% per-layer yield drops to 92% across 8 layers. Every failed stack wastes the entire assembly. SK Hynix achieves roughly 80% final yield on HBM3E. That is considered very good.

  • Packaging is a bottleneck: HBM must be placed on the same package as the GPU through TSMC's advanced CoWoS (Chip-on-Wafer-on-Substrate) process. CoWoS capacity has grown to an estimated 65,000–80,000 by end of 2025 but NVIDIA alone consumes roughly 60% of total capacity. If CoWoS is full, it does not matter how many HBM stacks exist. They cannot be assembled into GPUs.

  • Demand far exceeds supply: The pricing premium HBM commands over conventional DRAM has compressed from a peak of 18.4x in June 2025 to roughly 2.6x in April 2026. On the surface that reads as easing but it is the opposite. HBM prices have stayed elevated. DRAM has caught up. As Samsung, SK Hynix, and Micron pulled wafer capacity toward HBM, conventional DRAM supply tightened across every other segment and prices surged.

    From Tessara’s Memory Desk: HBM holds 2.64x premium over DRAM

    Both SK Hynix and Micron have confirmed their entire 2026 supply is sold out. Samsung and SK Hynix raised 2026 contract prices by roughly 20%. When people ask why an H100 costs $25,000+, a large part of the answer is: the memory on it is expensive, scarce, and getting more so.

What Comes Next: HBM4

The JEDEC HBM4 specification, released April 2025, is the most significant generational leap in HBM's history.

The data interface doubles from 1,024 bits to 2,048 bits per stack. Per-stack bandwidth pushes past 2.8 TB/s. Capacity scales up to 64 GB per stack, meaning an 8-stack GPU could carry up to 512 GB. That’s roughly 6× the H100's capacity.

But the most architecturally interesting change is the customizable logic base die. In previous generations, the base die at the bottom of each HBM stack was standardized.

In HBM4, chip designers like NVIDIA and AMD can build their own custom base die. This lets them implement custom memory controllers or near-memory compute functions directly inside the memory stack.

This opens the door to processing data at the memory rather than shipping it across the package to a separate chip. The boundary between memory and compute begins to blur.

SK Hynix began mass production in the second half of 2025. Samsung targets 2026. NVIDIA has confirmed that its Rubin platform uses HBM4 and says the new memory system nearly triples bandwidth versus Blackwell.

NAND Flash / SSDs: The Cold Tier With a Growing Role

This is the most physically distant memory from the compute cores, sitting on separate NVMe drives, connected to the system through a PCIe bus or over a network. Entirely off-board.

Enterprise NVMe SSDs cost roughly $0.05–$0.15 per GB. That’s about 100× cheaper than HBM. But they are also roughly 500–1,000× slower.

Unlike SRAM, DRAM, and HBM, NAND flash is non-volatile. It keeps data even when power is off, which is why it serves as storage rather than working memory. It works by trapping electrons inside transistor cells and the trapped charge represents a 0 or a 1, and it persists without electricity.

What SSDs solve: cheap, persistent capacity. They are used for model checkpoints, dataset storage, and weight repositories.

What SSDs cannot solve: active inference bandwidth. Moving model weights from SSD is far too slow for real-time serving.

Their role is starting to expand. Systems like FlexGen showed that SSDs can be used as overflow memory by shuttling weights between SSD, CPU memory, and GPU memory. That makes it possible to run models that do not fit in GPU memory, but at a major speed penalty (It achieved about 1 token per second)

So SSD offloading is a capacity workaround, not a bandwidth solution. For fast inference, the working set still needs to live in HBM.

The Investment Map

So far, this piece has been about physics. How data moves through silicon, which memory sits where, and why each tier exists. But every technical constraint we just walked through has a dollar sign attached to it.

HBM is scarce. Three companies make it. Packaging capacity is booked out. These are market structures with clear winners, clear losers, and a set of emerging bets on what comes after the current architecture.

Here is how the memory hierarchy maps to capital.

Who benefits from the shortage

The clearest winners are SK hynix, Samsung, and Micron, the three companies that control virtually all HBM supply. All three have confirmed their 2025 and 2026 supply is sold out. It means much of their future production capacity has already been reserved in advance by customers such as NVIDIA and hyperscalers. The chips have not all been manufactured yet. What is sold out are the production slots. That gives suppliers stronger pricing power and puts the largest buyers first in line.

Market share varies depending on the period and methodology, but the hierarchy is clear. SK hynix remains the leader, Samsung is trying to regain ground, and Micron has become a credible third supplier. Counterpoint Research puts Q3 2025 HBM share at roughly SK hynix 53%, Samsung 35%, Micron 11%.

Micron is investing heavily to expand memory and HBM-related capacity. In January 2026, Reuters reported a $24 billion memory chipmaking plant in Singapore and a separate $7 billion HBM packaging facility there that is expected to start contributing supply in 2027

TSMC benefits from a separate choke point. HBM only matters once it is packaged alongside the GPU through advanced packaging, especially CoWoS, where TSMC remains the dominant player at scale. That means the shortage is really two shortages layered together: memory supply and packaging capacity. Even if HBM output rises, supply does not truly ease unless packaging expands with it.

That is why the real question is not whether supply will grow. It will. It is whether supply is growing faster than demand, and whether the ramp is translating into real shipped systems. To answer that, we are closely tracking four things on Tessara:

  1. Customer demand

  2. HBM output

  3. Packaging capacity, and

  4. Manufacturing yield.

HBM constraint tracker on Tessara

If suppliers add capacity but customers are still locking supply years in advance, the shortage has not really eased. If HBM output rises but CoWoS packaging stays constrained, the bottleneck has simply moved downstream.

In practice, the market is easing only when several things happen at once: hyperscalers stop scrambling to reserve future supply, suppliers stop saying output is fully committed, packaging lead times begin to compress, and new generations like HBM4 ramp without yield problems. Until then, “capacity expansion” should be treated cautiously. What matters is not announced supply, but usable supply that can actually be packaged, shipped, and deployed.

Who gets squeezed

All of us!

As memory makers shift more capacity toward high-margin AI memory like HBM, supply tightens for the conventional DRAM and NAND used in PCs, smartphones, and other consumer devices. That pushes up memory prices and puts pressure on non-AI buyers first.

The most visible effect is higher device costs, especially in lower-end electronics where memory is a larger share of the bill of materials. Some smartphone makers are already responding by cutting production or lowering specifications. Expect our iPhones to get more expensive soon..

For PC vendors, memory’s share of total bill of materials is rising sharply. It typically accounts for about 15% to 18% of PC materials cost, but could rise to as much as 35% to 40% as memory prices surge. Vendors including Lenovo and HP have already warned that rising memory costs are pressuring shipments and forcing pricing adjustments, while IDC expects the PC market to contract by at least 4.9% in 2026.

Cloud providers sit in a more complex position. They benefit from rising AI demand, but they also bear the higher cost of the memory and infrastructure required to serve it.

The Next Wave

SK hynix appears to have the early HBM4 lead - not surprising since it jointly developed the HBM standard with AMD back in 2014. The most important architectural change is the new customizable base die, which gives accelerator designers more control over how memory is managed and integrated with the compute system. That does not mean memory and compute have fully merged, but it does mean HBM is becoming less of a commodity component and more of a semi-custom part of the AI stack

Samsung and Micron are not far behind. Samsung has already started shipping HBM4 to customers, and Micron is pushing hard on next-generation memory and packaging. The race is no longer just about adding capacity. It is about ramping fast enough, with strong enough yields, to win qualification slots on the next wave of AI accelerators

Micron’s partnership with Applied Materials is a good example of where the industry is going. The two are co-developing the next-generation of DRAM, HBM, and NAND. The focus is higher performance and energy efficiency for AI-specific memory. This is a signal that the memory manufacturers are not just scaling capacity but actively retooling their process technology for AI workloads.

Further out, the next set of advances will likely come from deeper packaging and interconnect changes: higher stack counts, eventual hybrid bonding, and tighter links between memory and compute. The current path of more stacks, wider interfaces, and better packaging still has more room to run. But the industry is also laying the groundwork for a more fundamental redesign of how AI systems move data.

Where the Memory Wall Goes From Here

We started this piece with a simple observation: the GPU is not the bottleneck. Memory is.

But now you can see that "memory" was always too vague a word. The real picture is a hierarchy, four tiers of silicon and storage, each solving a different vertex of the tradeoff triangle.

Every token your LLM generates is gated not by how fast the chip can multiply, but by how fast it can read. Every dollar spent on an AI accelerator is increasingly a dollar spent on memory. Every architectural decision like how many GPUs, how much context, which models can be served, etc., traces back to the hierarchy we just walked.

The companies that solve memory movement will define the next era of AI performance. The ones that treat memory as just "how many GBs" will build faster chips that spend most of their time doing nothing.

Cheers,

Teng Yan & Arvind

Tessara is the live supply-chain map of the AI build, for investors. We track what's binding in the supply chain and what it means for what you own. 300+ companies across compute, memory, foundry, networking, and power.

This article is for informational and research purposes only. It is not financial advice, investment advice, or a recommendation to buy or sell any security. Tessara Research does not publish price targets. The views expressed here reflect our analysis at the time of publication and may change as new evidence arrives. Readers should do their own research and consult a qualified financial adviser before making investment decisions.

Keep Reading