New Power, Memory, Interconnect, and Thermal Architectures for AI Infrastructure at Scale
The trajectory of new industries can be uncertain, with some experiencing a brief moment of glory, others struggling to stay relevant, and a select few continuing to innovate and evolve. The AI industry is now going through another evolution focused on inference. Over the last few years, AI data centers have developed novel architectures to achieve the compute performance required to train Large Language Models (LLMs). While those architectures were effective for training LLMs, they fall short for AI inference, where performance depends on data movement, energy efficiency, and interconnect bandwidth.
As the industry shifts from training to inference, analysts project inference will comprise 85% of all enterprise AI workloads within three years. This shift is already highlighting system-level constraints in current AI infrastructure: finite global power capacity, ineffective memory management for inference, unsustainable heat flux levels at scale, and interconnects that cannot support rack-scale computing.
These constraints manifest as four distinct yet interconnected bottlenecks: the power, memory, thermal, and copper walls. Scaling AI infrastructure efficiently and sustainably requires addressing all four simultaneously. This article defines each constraint, outlines the architectural responses emerging to address it, and explains why system-level co-design is the most effective approach.
The Power Wall: Token Economics and Grid-to-Core Efficiency
Power has become the most limited resource in AI data center operations. The United States operates approximately 1,250 gigawatts of total generation capacity, yet meeting the combined demands of AI inference and training will require approximately another 400 gigawatts to be added within three years. This gap cannot be closed through grid expansion alone.
In response, hyperscalers are pursuing “bring your own power” as a core strategy, sourcing generation capacity independently of the grid. XAI, for example, deploys on-site gas and diesel generators to maintain data center operations outside grid constraints, signaling a structural shift in how AI infrastructure planners approach energy procurement. Even with sufficient power supply, efficiency remains the fundamental constraint. Power efficiency is the defining metric for AI data centers.
To address the power wall, data center operators must increase tokens/watt efficiency, ensuring AI inference remains sustainable and economically viable. The cost of generating each token reflects both compute efficiency and the underlying power delivery architecture. Reducing cost per token requires improvements in accelerator design and system-level power delivery and management to minimize losses.
Power delivery response to dynamic workloads determines whether efficiency targets are met. Inference workloads generate rapid, bursty demand as queries arrive and models activate different compute paths. This imposes strict requirements on power delivery networks, which must respond quickly while maintaining stable voltage levels. Addressing these requirements demands architectural changes across the entire delivery chain, from the facility to the processor.
At the facility level, high-voltage distribution, including emerging 800V architectures, reduces conversion losses. Solid-state transformers (SSTs) eliminate low-frequency conversion stages, feed DC microgrids directly, and reduce conversion steps between the medium-voltage grid and the processor, improving overall system efficiency.
Closer to the processor, power delivery architectures are evolving from grid to core, advancing in stages to improve end-to-end efficiency. Discrete voltage regulator modules (VRMs) move regulation closer to the load, while modular integrated voltage regulators migrate to the substrate, shortening the delivery path. The final stage embeds regulation directly in silicon, achieving point-of-load delivery at the processor die.
Distance matters: each additional millimeter between a voltage regulator and its processor introduces losses that can scale to hundreds of watts at the data center level. Advanced digital controllers enable fast transient response, phase management, and adaptive regulation across these dense, high-current delivery paths.
The Memory Wall: SRAM-Centric Architectures Redefine Inference
Compute performance continues to scale, yet memory bandwidth has not kept pace. Industry benchmarks show compute performance growing roughly 3X every two years, while memory bandwidth has increased by just 1.6X, creating a widening gap that leaves processors waiting for data. In inference workloads, where execution requires frequent access to model weights and intermediate data, this imbalance directly limits throughput.
Training systems rely on high-bandwidth memory to maintain large, parallel compute arrays. Inference workloads behave differently. They execute sequentially, with lower arithmetic intensity and higher sensitivity to memory access latency. Performance depends less on peak compute throughput and more on efficient data movement.
This shift is driving architectures toward SRAM-centric designs that place memory closer to compute. On-chip and near-chip SRAM deliver lower access latency and higher effective bandwidth than off-chip DRAM. Reducing reliance on external memory limits data movement across high-latency, power-intensive interfaces.
Inference accelerators increasingly implement this approach by storing weights and activations locally. This improves response time and increases throughput by minimizing memory access delays.
Some emerging designs extend this model by tightly coupling memory and compute within the same package or across high-bandwidth, low-latency interconnects. These architectures reduce data movement, improve execution predictability, and avoid many inefficiencies associated with traditional memory hierarchies. Â
Â
Companies like Cerebras and d-Matrix demonstrate significant tokens/watt improvements by implementing these architectures. Recent NVIDIA announcements indicate the same approach will drive their next generation of inference devices. Â
The Thermal Wall: Heat as a Critical Infrastructure Constraint
As AI rack power density scales from tens of kilowatts to over 100 kW, heat dissipation and removal have become fundamental infrastructure constraints. At projected densities of 600 kW to 1 MW per rack, conventional air cooling can no longer sustain heat flux levels. In response, data center operators are shifting to liquid cooling architectures such as direct liquid cooling and immersion, which support higher rack densities. Castrol, a company once focused on Automotive and Industrial products, has now liquid cooling products that are recognized under the Open Compute Project Foundation (OCP) Inspired program.
A third class of solid-state cooling devices addresses heat removal at the chip level. Frore Systems’ AirJet, a MEMS-based active cooling chip, uses ultrasonic vibrating membranes to generate high-velocity pulsating air jets across a processor surface, dissipating heat within a 2.8 mm profile while consuming approximately 1 W of power.
At current thermal capacities, these devices target CPU-class and mobile workloads rather than GPU-scale power densities. The category is advancing toward data center applications, where MEMS-based manufacturing expertise could become a key differentiator. These micro-cooling devices can be used to cool adjacent components like optical transceivers and other memory devices near the GPU and CPU as the system fans that were used earlier are already being reduced or eliminated
The Copper Wall: Optical Interconnects Enable AI Scale
While memory and power define local performance, interconnects determine system scale. As AI clusters expand from single racks to multi-rack and building-scale deployments, traditional copper interconnects encounter limits in bandwidth, reach, and signal integrity.
Maintaining performance at higher data rates requires additional power and tighter signal conditioning. These limitations define the copper wall and prevent AI fabrics from scaling beyond conventional electrical interconnects.
Optical links deliver higher bandwidth over longer distances with improved signal integrity and lower latency at scale, enabling disaggregation of compute resources across racks without compromising communication performance. The transition from copper to linear pluggable optics is already underway in scale-up fabrics, with co-packaged optical solutions on the near-term roadmap, reducing power consumption and latency by eliminating power-intensive signal processing stages and shortening electrical paths.
The gains are quantifiable. Google’s Jupiter network, which incorporates MEMS-based optical circuit switching (OCS) and software-defined networking, achieved a 41% reduction in power and a 30% reduction in capital expenditure relative to its prior Clos fabric architecture.
OCS enables dynamic topology reconfiguration by adjusting logical connectivity through software rather than physical rewiring, delivering up to 3Ã faster reconfiguration compared to patch-panel-based approaches. These principles now drive emerging AI cluster interconnect designs, where software-defined optical fabrics provide per-link telemetry and demand-aware routing at scale.
Breaking the Walls Requires Innovation Across the System
The memory, power, thermal, and copper walls define the performance envelope of inference workloads in AI data centers. SRAM-centric architectures reduce data movement yet require tightly integrated power delivery to support high-density, low-latency compute.
Fast, localized regulation maintains stability under dynamic workloads, and thermal management determines whether power density remains sustainable at the rack level. Optical interconnects enable system-level scaling while increasing demands on memory bandwidth and power efficiency across the fabric.
Improving performance and total cost of ownership (TCO) requires addressing all four bottlenecks together. Their interdependence is driving a shift toward system-level co-design, where accelerator architectures, memory hierarchies, power delivery, thermal management, and interconnects are developed as co-optimized silicon, packaging, and firmware stacks.
This shift is reflected in architectural trends such as deterministic execution models that reduce variability in compute timing, memory-forward designs prioritizing data locality and bandwidth efficiency, and software-defined optical fabrics replacing static topologies with demand-aware routing and per-link telemetry.
Future platforms will combine CPUs, GPUs, and multiple inference accelerators within a single system, with workloads dynamically routed based on query complexity, model structure, and latency requirements. Training-oriented tasks remain on general-purpose or high-throughput processors, while inference-specific accelerators handle targeted workloads.
These architectural shifts extend beyond the data center. The infrastructure built today will form the foundation for future edge deployment, where memory, power, thermal, and interconnect constraints apply under tighter thermal budgets, stricter power envelopes, and without the redundancy of centralized facilities. How effectively the industry addresses these four walls will determine the scale, efficiency, and reach of next-generation AI systems.
Infineon enables AI data center efficiency across the power stack, with a grid-to-core portfolio spanning Si, GaN, and SiC semiconductors, digital multiphase controllers, IBC solutions, and solid-state transformers aligned with the transition to 800 V architectures. Learn more at infineon.com/ai-data-center.
Leave a Reply
You must Register or Login to post a comment.
How it works
Once you click Generate, Ollama reads this article and crafts 5 comprehension questions. Your answers are graded against the article content — general knowledge won't be enough. Score 70+ to count toward your certificate.
Questions are cached — you'll always get the same 5 for this article.