
As high-performance computing (HPC) workloads become increasingly complex, generative artificial intelligence (AI) is being progressively integrated into modern systems, thereby driving the demand for advanced memory solutions. To meet these evolving requirements, the industry is developing next-generation memory architectures that maximize bandwidth, minimize latency, and enhance power efficiency.
Technology advances in DRAM, LPDDR, and specialized memory solutions are redefining computing performance, with AI-optimized memory playing a pivotal role in driving efficiency and scalability. This article examines the latest breakthroughs in memory technology and the growing impact of AI applications on memory designs.
Advanced memory architectures
Memory technology is advancing to meet the stringent performance requirements of AI, AIoT, and 5G systems. The industry is witnessing a paradigm shift with the widespread adoption of DDR5 and HBM3E, offering higher bandwidth and improved energy efficiency.
DDR5, with a per-pin data rate of up to 6.4 Gbps, delivers 51.2 GB/s per module, nearly doubling DDR4’s performance while reducing the voltage from 1.2 V to 1.1 V for improved power efficiency. HBM3E extends bandwidth scaling, exceeding 1.2 TB/s per stack, making it a compelling solution for data-intensive AI training models. However, it’s impractical for mobile and edge deployments due to excessive power requirements.

Figure 1 The above diagram chronicles memory scaling from MCU-based embedded systems to AI accelerators serving high-end applications. Source: Winbond
With LPDDR6 projected to exceed 150 GB/s by 2026, low-power DRAM is evolving toward higher throughput and energy efficiency, addressing the challenges of AI smartphones and embedded AI accelerators. Winbond is actively developing small-capacity DDR5 and LPDDR4 solutions optimized for power-sensitive applications around its CUBE memory platform, which achieves over 1 TB/s bandwidth with a significant reduction in thermal dissipation.
With anticipated capacity scaling up to 8 GB per set or even higher, such as 4Hi WoW, based on one reticle size, which can achieve >70 GB density and bandwidth of 40TB/s, CUBE is positioned as a viable alternative to traditional DRAM architectures for AI-driven edge computing.
In addition, the CUBE sub-series, known as CUBE-Lite, offers bandwidth ranging from 8 to 16 GB/s (equivalent to LPDDR4x x16/x32), while operating at only 30% of the power consumption of LPDDR4x. Without requiring an LPDDR4 PHY, system-on-chips (SoCs) only need to integrate the CUBE-Lite controller to achieve bandwidth performance comparable to full-speed LPDDR4x. This not only eliminates the high cost of PHY licensing but also allows the use of mature process nodes such as 28 nm or even 40 nm, achieving performance levels of 12-nm node.
This architecture is particularly suitable for AI SoCs or AI MCUs that come integrated with NPUs, enabling battery-powered TinyML edge devices. Combined with Micro Linux operating systems and AI model execution, it can be applied to low-power AI image sensor processor (ISP) edge scenarios such as IP cameras, AI glasses, and wearable devices, effectively achieving both system power optimization and chip area reduction.
Furthermore, SoCs without LPDDR4 PHY and only CUBE-light controller can achieve smaller die sizes and improved system power efficiency.
The architecture is highly suitable for AI SoCs—MCUs, MPUs, and NPUs—and TinyML endpoint AI devices designed for battery operation. The operating system is Micro Linux combined with an AI model for AI SoCs. The end applications include AI ISP for IP cameras, AI glasses, and wearable devices.

Figure 2 The above diagram chronicles the evolution of memory bandwidth with DRAM power usage. Source: Winbond
Memory bottlenecks in generative AI deployment
The exponential growth of generative AI models has created unprecedented constraints on memory bandwidth and latency. AI workloads, particularly those relying on transformer-based architectures, require extensive computational throughput and high-speed data retrieval.
For instance, deploying LLamA2 7B in INT8 mode requires at least 7 GB of DRAM or 3.5 GB in INT4 mode, which highlights the limitations of conventional mobile memory capacities. Current AI smartphones utilizing LPDDR5 (68 GB/s bandwidth) face significant bottlenecks, necessitating a transition to LPDDR6. However, interim solutions are required to bridge the bandwidth gap until LPDDR6 commercialization.
At the system level, AI edge applications in robotics, autonomous vehicles, and smart sensors impose additional constraints on power efficiency and heat dissipation. While JEDEC standards continue to evolve toward DDR6 and HBM4 to improve bandwidth utilization, custom memory architectures provide scalable, high-performance alternatives that align with AI SoC requirements.
Thermal management and energy efficiency constraints
Deploying large-scale AI models on end devices introduces significant thermal management and energy efficiency challenges. AI-driven workloads inherently consume substantial power, generating excessive heat that can degrade system stability and performance.
- On-device memory expansion: Mobile devices must integrate higher-capacity memory solutions to minimize reliance on cloud-based AI processing and reduce latency. Traditional DRAM scaling is approaching physical limits, necessitating hybrid architectures integrating high-bandwidth and low-power memory.
- HBM3E vs CUBE for AI SoCs: While HBM3E achieves high throughput, its power requirements exceed 30 W per stack, making it unsuitable for mobile and edge applications. Here, memory solutions like CUBE can serve as an alternative last level cache (LLC), reducing on-chip SRAM dependency while maintaining high-speed data access. The shift toward sub-7-nm logic processes exacerbates SRAM scaling limitations, emphasizing the need for new cache solutions.
- Thermal optimization strategies: As AI processing generates heat loads exceeding 15 W per chip, effective power distribution and dissipation mechanisms are critical. Custom DRAM solutions that optimize refresh cycles and employ TSV-based packaging techniques contribute to power-efficient AI execution in compact form factors.
DDR5 and DDR6: Accelerating AI compute performance
The evolution of DDR5 and DDR6 represents a significant inflexion point in AI system architecture, delivering enhanced memory bandwidth, lower latency, and greater scalability.
DDR5, with 8-bank group architecture and on-die error correction code (ECC), provides superior data integrity and efficiency, making it well-suited for AI-enhanced PCs and high-performance laptops. With an effective peak transfer rate of 51.2 GB/s per module, DDR5 enables real-time AI inference, seamless multitasking, and high-speed data processing.
DDR6, still in development, is expected to introduce bandwidth exceeding 200 GB/s per module, a 20% reduction in power consumption along with optimized AI accelerator support, further pushing AI compute capabilities to new limits.

Figure 3 CUBE, an AI-optimized memory solution, leverages through-silicon via (TSV) interconnects to integrate high-bandwidth memory characteristics with a low-power profile. Source: Winbond
The convergence of AI-driven workloads, performance scaling constraints, and the need for power-efficient memory solutions is shaping the transformation of the memory market. Generative AI continues to accelerate the demand for low-latency, high-bandwidth memory architectures, leading to innovation across DRAM and custom memory solutions.
As AI models become increasingly complex, the need for optimized, power-efficient memory architectures will become increasingly critical. Here, technological innovation will ensure commercial realization of cutting edge of AI memory solutions, bridging the gap between high-performance computing and sustainable, scalable memory devices.
Jacky Tseng is deputy director of CMS CUBE product line at Winbond. Prior to joining Winbond in 2011, he served as a senior engineer at Hon-Hai.
Special Section: AI Design
- The AI design world in 2026: What you need to know
- AI workloads demand smarter SoC interconnect design
- AI’s insatiable appetite for memory
The post The AI-tuned DRAM solutions for edge AI workloads appeared first on EDN.