Last-level cache has become a critical SoC design element

As AI workloads extend across nearly every technology sector, systems must move more data, use memory more efficiently, and respond more predictably than traditional design methodologies allow. These pressures are exposing limitations in conventional system-on-chip (SoC) architectures as compute becomes increasingly heterogeneous and traffic patterns become more complex.

Modern SoCs integrate CPUs, GPUs, NPUs, and specialized accelerators that must operate concurrently, placing unprecedented strain on memory hierarchies and interconnects. To Keep processing units fully utilized requires high-bandwidth, low-latency access to data, making the memory hierarchy as critical to overall system effectiveness as raw performance.

On-chip interconnects move data quickly and predictably, but once requests reach external memory, latency increases, and timing becomes less consistent. As more data accesses go off chip, the gap between compute throughput and data availability widens. In these conditions, processing engines stall while waiting for memory transactions to complete, creating data starvation.

 

The role of last-level cache

To mitigate this imbalance, SoC designers are increasingly turning to last-level cache (LLC). Positioned between external memory and internal subsystems, LLC stores frequently accessed data close to compute resources, allowing requests to be served with significantly lower latency.

Unlike static buffers, an LLC dynamically fetches and evicts cache lines based on runtime behavior without direct CPU intervention. When deployed effectively, this architectural layer delivers measurable benefits, including substantial reductions in external memory traffic and power consumption.

Simply including an LLC does not guarantee improved performance. Configuring the cache correctly is a complex task that must account for workload characteristics, compute-unit behavior, and real-time constraints. Poorly chosen parameters can waste area without meaningful gains, while under-provisioned configurations may fail to alleviate memory bottlenecks.

Architects must carefully determine cache capacity, the number of cache instances, and internal banking structures to support sufficient parallelism. Partitioning strategies must also be defined to ensure that individual IP blocks receive the bandwidth and predictability they require. While some settings can be adjusted later through software, foundational decisions on cache size, banking, and associativity must be finalized early in the development cycle.

The role of last-level cache is shown in successful designs. Source: Arteris

Factors influencing cache behavior

Banking configuration illustrates this trade-off clearly. Increasing the number of cache banks improves internal parallelism and throughput, but it also increases silicon area. Workloads with largely sequential access patterns may see limited benefit from aggressive banking.

In contrast, highly parallel workloads, especially those driven by AI accelerators or GPUs, require substantial internal concurrency to maintain utilization. Because these characteristics vary by application, banking decisions must be informed by realistic workload analysis during the architectural phase.

Cache capacity is just as important. A cache that is too small struggles to achieve acceptable hit rates, pushing excessive traffic to external memory. Conversely, oversizing the cache often yields diminishing returns relative to the additional area consumed. The optimal balance depends on actual runtime behavior rather than theoretical assumptions.

In practice, acceptable hit rates vary widely. Some systems can tolerate moderate miss rates if latency and power reductions outweigh the cost, while real-time applications demand consistently high hit rates to maintain deterministic behavior.

This variability underscores why no single LLC configuration is universally optimal. Mobile devices may require only a few megabytes of cache to balance power efficiency and responsiveness. At the same time, servers and HPC platforms often deploy tens or hundreds of megabytes to reduce DRAM pressure. Despite these differences, successful designs rely on a common principle in which cache parameters are derived from the workloads the system will actually execute.

Managing shared caches

Diversity in system demands further complicates how an LLC must be structured. Automotive chips built around concurrent vision processing and strict timing requirements operate under very different constraints than data-center platforms optimized for accelerator-heavy inference at scale. Even within a single chip, CPUs, accelerators, and I/O subsystems generate distinct access patterns with different latency sensitivities.

The LLC must accommodate all of them without allowing one workload to interfere with another’s real-time guarantees. This makes early understanding of system-level access behavior essential, since cache configuration otherwise becomes speculative at best.

Partitioning provides a powerful mechanism for preserving determinism in such environments. By allocating portions of cache capacity to specific clients, architects can prevent high-bandwidth workloads from starving latency-sensitive subsystems. This capability is particularly critical in environments that must meet strict timing guarantees. Partition sizes must be tuned carefully, as oversizing wastes area while undersizing risks violating latency requirements.

Configuring a last-level cache is ultimately a multidimensional challenge shaped by workload demands, compute topology, latency requirements, and silicon constraints. Achieving the right balance between performance, determinism, power, and area depends on understanding how an SoC behaves under real operating conditions.

To address this, SoC teams increasingly rely on system-level simulation using realistic data flow profiles generated by multiple on-chip request sources. This approach allows teams to evaluate cache behavior before key architectural decisions are finalized. It helps identify bottlenecks, validate cache sizing, and determine when isolation mechanisms such as partitioning are required to preserve real-time guarantees.

Arteris developed its CodaCache IP, which operates as a configurable last-level cache between on-chip initiators and different types of external memories such as DDR-DRAM, HBM and even NVM for execution in place (EIP) use cases. With CodaCache, architects can equip their SoC fabric with the optimal configuration to address intelligent, scalable, and automated data management in a wide range of applications.

Andre Bonnardot is product marketing manager at Arteris.

Related Content

  • Understanding cache placement
  • Optimizing for instruction caches
  • How to Turbo Charge Your SoC’s CPU(s)
  • Bringing SOT-MRAM Tech Closer to Cache Memory
  • SoC design: When a network-on-chip meets cache coherency

The post Last-level cache has become a critical SoC design element appeared first on EDN.

Last-level cache has become a critical SoC design element

As AI workloads extend across nearly every technology sector, systems must move more data, use…

Access to this page has been denied.

Access to this page has been denied either because we believe you are using…

Würth Elektronik Component Data Live in Accuris

Würth Elektronik is providing live, manufacturer‑verified data for more than 22,000 electronic and electromechanical components…

Beyond the China+1 Strategy: How Electronics Sourcing is Adapting to 2025’s Geopolitical Reality

Since 2008, China has quietly reshaped global electronics sourcing, and the China+1 strategy hedges against…