Chapter 4: GPU Memory System Design

Graphics Processing Units (GPUs) have evolved into highly parallel, programmable accelerators capable of achieving high performance and energy efficiency on a wide range of applications. The memory system is a critical component of modern GPU architectures, as it must supply the massive number of concurrent threads with fast access to data. In this chapter, we will explore the key elements of GPU memory system design, including the first-level memory structures, on-chip interconnection network, memory partition units, and research directions for future GPU memory systems.

First-Level Memory Structures

The first-level memory structures in a GPU are responsible for providing fast access to frequently used data and reducing the number of accesses to the lower levels of the memory hierarchy. These structures typically include scratchpad memory, L1 data cache, and L1 texture cache.

Scratchpad Memory and L1 Data Cache

Scratchpad memory, also known as shared memory in NVIDIA's CUDA programming model or local memory in OpenCL, is a low-latency, software-managed memory space shared by all threads within a cooperative thread array (CTA) or workgroup. Scratchpad memory is typically implemented using a banked SRAM structure to enable parallel access by multiple threads.

Figure 4.1 illustrates a unified L1 data cache and scratchpad memory organization, similar to the design used in NVIDIA's Fermi and Kepler architectures [Minkin et al., 2012].

                                    Address Crossbar
                                          |
                                          v
                        Data Array (Configurable as Scratchpad or Cache)
                                          |
                                          v  
                                    Data Crossbar
                                          |
                                          v
                                 Load/Store Unit

Figure 4.1: Unified L1 data cache and scratchpad memory organization.

The key components of this design are:

Data Array: A highly-banked SRAM structure that can be configured as either scratchpad memory or L1 data cache. Each bank is 32-bits wide and has its own decoder for independent access.
Address Crossbar: Distributes memory addresses from the load/store unit to the appropriate banks in the data array.
Data Crossbar: Routes data from the banks to the load/store unit, which then writes the data to the register file.
Load/Store Unit: Computes memory addresses, applies coalescing rules, and breaks memory accesses into individual coalesced accesses.

Scratchpad memory accesses bypass the tag lookup stage, as the memory is directly mapped. Bank conflicts are handled by splitting the conflicting accesses into multiple cycles, with the conflicting portions being replayed in subsequent cycles.

The L1 data cache is used to store a subset of the global memory address space. Accesses to the L1 data cache involve a tag lookup to determine if the requested data is present. The cache block size is typically 128 bytes, which can be further divided into 32-byte sectors to match the minimum data size that can be read from graphics DRAM (e.g., GDDR5) in a single access.

L1 Texture Cache

Texture memory is a read-only memory space optimized for spatial locality and is commonly used in graphics workloads. The L1 texture cache is designed to exploit the 2D spatial locality present in texture accesses.

Figure 4.2 shows a typical L1 texture cache organization.

                          Texture Coordinates
                                  |
                                  v
                            Address Mapping
                                  |
                                  v
                             Tag Array
                                  |
                                  v
                             Data Array
                                  |
                                  v
                           Texture Filtering
                                  |
                                  v
                           Filtered Texels

Figure 4.2: L1 texture cache organization.

The main components of the L1 texture cache are:

Address Mapping: Converts texture coordinates into cache addresses.
Tag Array: Stores the tags for each cache line to determine if the requested data is present.
Data Array: Stores the actual texture data.
Texture Filtering: Performs interpolation and filtering operations on the fetched texture data to generate the final filtered texels.

The L1 texture cache typically employs a tile-based organization to exploit spatial locality. The cache is divided into smaller tiles (e.g., 4x4 or 8x8 texels), and each tile is stored in a contiguous manner to minimize the number of cache lines accessed for a given texture fetch.

Unified Texture and Data Cache

Recent GPU architectures, such as NVIDIA's Maxwell and Pascal, have introduced a unified texture and data cache to improve cache utilization and reduce the overall cache footprint [Heinrich et al., 2017]. In this design, the L1 data cache and L1 texture cache are combined into a single physical cache, with the ability to dynamically allocate capacity between the two based on the workload's requirements.

Figure 4.3 illustrates a unified texture and data cache organization.

                                Memory Requests
                                       |
                                       v
                                  Cache Controller
                                 /             \
                                /               \
                               /                 \
                              v                   v
                      Data Cache Partition   Texture Cache Partition
                              |                   |
                              |                   |
                              v                   v
                           Data Array         Texture Data Array

Figure 4.3: Unified texture and data cache organization.

The main components of the unified cache design are:

Cache Controller: Receives memory requests and determines whether they should be serviced by the data cache partition or the texture cache partition.
Data Cache Partition: Handles accesses to the global memory space, similar to the standalone L1 data cache.
Texture Cache Partition: Handles texture memory accesses, similar to the standalone L1 texture cache.
Data Array: A shared data array that stores both global memory data and texture data.

The unified cache design allows for better utilization of the available cache capacity, as the partition sizes can be adjusted based on the workload's access patterns. This flexibility can lead to improved performance and energy efficiency compared to fixed-size, separate L1 caches.

On-Chip Interconnection Network

The on-chip interconnection network is responsible for connecting the GPU cores (also called streaming multiprocessors or compute units) to the memory partition units. The interconnect must provide high bandwidth and low latency to support the massive parallelism in GPU workloads.

Modern GPUs typically employ a crossbar or a mesh topology for the on-chip interconnect. A crossbar provides full connectivity between all cores and memory partitions, enabling high-bandwidth communication at the cost of increased area and power consumption. A mesh topology, on the other hand, offers a more scalable solution by connecting each core to its neighboring cores and memory partitions, forming a grid-like structure.

Figure 4.4 shows an example of a mesh interconnect in a GPU.

        Core   Core   Core   Core
         |      |      |      |  
        ——      ——     ——     ——
        |      |      |      |
        Core   Core   Core   Core
         |      |      |      |
        ——     ——     ——     ——  
         |      |      |      |
        Core   Core   Core   Core
         |      |      |      |
        ——     ——     ——     ——
         |      |      |      |  
        Mem    Mem    Mem    Mem
        Part.  Part.  Part.  Part.

Figure 4.4: Mesh interconnect in a GPU.

The mesh interconnect allows for efficient data transfer between cores and memory partitions while minimizing the area and power overhead. Advanced routing algorithms and flow control mechanisms are employed to ensure high performance and avoid congestion.

Memory Partition Unit

The memory partition unit is responsible for handling memory requests from the GPU cores and managing the off-chip DRAM. Each memory partition typically includes an L2 cache, atomic operation support, and a memory access scheduler.

L2 Cache

The L2 cache is a shared cache that sits between the GPU cores and the off-chip DRAM. Its primary purpose is to reduce the number of accesses to the high-latency, energy-intensive DRAM by caching frequently accessed data.

GPU L2 caches are typically designed as a set-associative, write-back cache with a large capacity (e.g., 2-4 MB) and a high bandwidth. The L2 cache is partitioned across multiple memory partitions to enable parallel access and improve throughput.

Figure 4.5 illustrates the organization of an L2 cache in a GPU memory partition.

                            Memory Requests
                                   |
                                   v
                             L2 Cache Controller
                                   |
                                   v
                              Tag Array
                                   |
                                   v
                              Data Array
                                   |
                                   v
                             Memory Scheduler
                                   |
                                   v
                                 DRAM

Figure 4.5: L2 cache organization in a GPU memory partition.

The L2 cache controller receives memory requests from the GPU cores and checks the tag array to determine if the requested data is present in the cache. On a cache hit, the data is retrieved from the data array and sent back to the requesting core. On a cache miss, the request is forwarded to the memory scheduler, which then fetches the data from the DRAM.

Atomic Operations

Atomic operations are essential for synchronization and communication between threads in parallel workloads. GPUs support a variety of atomic operations, such as atomic add, min, max, and compare-and-swap, which guarantee atomicity when multiple threads access the same memory location simultaneously.

Atomic operations are typically implemented in the memory partition units to ensure low-latency and high-throughput execution. Dedicated hardware units, such as atomic operation units (AOUs), are employed to handle atomic requests efficiently.

Figure 4.6 shows an example of an atomic operation unit in a GPU memory partition.

                            Atomic Requests
                                   |
                                   v
                          Atomic Operation Unit
                                   |
                                   v
                            L2 Cache/DRAM

Figure 4.6: Atomic operation unit in a GPU memory partition.

The AOU receives atomic requests from the GPU cores and performs the requested operation on the target memory location. If the memory location is present in the L2 cache, the AOU directly updates the cache data. If the memory location is not cached, the AOU fetches the data from the DRAM, performs the atomic operation, and then writes the result back to the DRAM.

Memory Access Scheduler

The memory access scheduler is responsible for managing the flow of memory requests to the off-chip DRAM. Its primary goal is to maximize DRAM bandwidth utilization while minimizing the latency of memory accesses.

GPU memory schedulers employ various scheduling algorithms and optimizations to achieve high performance. Some common techniques include:

Out-of-order scheduling: Reordering memory requests to maximize row buffer hits and minimize DRAM precharge and activation overhead.
Bank-level parallelism: Exploiting the parallelism available across multiple DRAM banks to enable concurrent access to different memory regions.
Write-to-read turnaround optimization: Minimizing the latency penalty incurred when switching between write and read operations in DRAM.
Address interleaving: Distributing memory accesses across different channels, ranks, and banks to maximize parallelism and avoid contention.

Figure 4.7 illustrates a high-level view of a memory access scheduler in a GPU memory partition.

                            Memory Requests
                                   |
                                   v
                           Memory Scheduler
                                   |
                                   v
                    Channel   Channel   Channel   Channel
                      |         |         |         |
                      v         v         v         v
                    Rank      Rank      Rank      Rank
                      |         |         |         |
                      v         v         v         v  
                    Bank      Bank      Bank      Bank

Figure 4.7: Memory access scheduler in a GPU memory partition.

The memory scheduler receives memory requests from the L2 cache and the atomic operation units and decides when and in what order to issue these requests to the DRAM. By carefully scheduling memory accesses, the scheduler can significantly improve DRAM bandwidth utilization and reduce the average memory access latency.

Research Directions for GPU Memory Systems

As GPU architectures continue to evolve and the demands of parallel workloads grow, there are several research directions aimed at improving the performance and efficiency of GPU memory systems. Some of the key research areas include:

Memory Access Scheduling and Interconnection Network Design

As the number of cores and memory partitions in GPUs continues to increase, the design of the memory access scheduler and interconnection network becomes crucial for achieving high performance. Research in this area focuses on developing novel scheduling algorithms and interconnect topologies that can efficiently handle the massive parallelism and complex memory access patterns of GPU workloads.

For example, Jia et al. [2012] propose a memory scheduling algorithm called "Staged Memory Scheduling" (SMS) that aims to improve DRAM bank-level parallelism and reduce memory access latency. SMS divides the memory request queue into two stages: batch formation and batch scheduling. In the batch formation stage, requests are grouped into batches based on their bank and row addresses to exploit row locality. In the batch scheduling stage, batches are prioritized based on their age and criticality to ensure fairness and reduce stalls.

Another example is the work by Kim et al. [2012], which proposes a high-bandwidth memory (HBM) architecture for GPUs. HBM stacks multiple DRAM dies on top of each other and connects them using through-silicon vias (TSVs), enabling much higher bandwidth and lower latency compared to traditional GDDR memories. The authors also propose a novel memory controller design that can efficiently manage the increased parallelism and complexity of HBM.

Caching Effectiveness

GPUs employ a variety of caching mechanisms to reduce the number of off-chip memory accesses and improve performance. However, the effectiveness of these caches can vary significantly depending on the characteristics of the workload and the cache design.

Research in this area aims to improve the effectiveness of GPU caches through techniques such as cache bypassing, cache compression, and adaptive cache management.

For instance, Huangfu and Xie [2016] propose a dynamic cache bypassing scheme for GPUs that uses a simple yet effective heuristic to determine whether a memory request should be cached or bypassed based on its reuse distance. The scheme adapts to the runtime behavior of the application and can significantly reduce cache pollution and improve performance.

Another example is the work by Vijaykumar et al. [2015], which proposes a compressed cache architecture for GPUs. The authors observe that many GPU applications exhibit significant data redundancy, which can be exploited to increase the effective capacity of the caches. They propose a novel compression scheme that can achieve high compression ratios while incurring minimal latency overhead.

Memory Request Prioritization and Cache Bypassing

In GPUs, memory requests from different warps and threads can have varying levels of criticality and impact on overall performance. Prioritizing critical requests and bypassing non-critical ones can help reduce memory latency and improve resource utilization.

Research in this area explores techniques for identifying and prioritizing critical memory requests, as well as mechanisms for selectively bypassing the caches.

For example, Jog et al. [2013] propose a memory request prioritization scheme called "Critical-Aware Warp Acceleration" (CAWA). CAWA identifies critical warps that are likely to stall the pipeline and prioritizes their memory requests over those of non-critical warps. The scheme uses a combination of static and dynamic information, such as the number of dependent instructions and the age of the warp, to determine criticality.

Lee et al. [2015] propose a cache bypassing scheme for GPUs that aims to reduce cache pollution and improve the timeliness of memory accesses. The scheme uses a PC-based prediction mechanism to identify memory requests that are unlikely to benefit from caching and bypasses them directly to the lower-level memory hierarchy. The authors show that their scheme can significantly improve performance and energy efficiency compared to a baseline GPU without bypassing.

Exploiting Inter-Warp Heterogeneity

GPUs execute a large number of warps concurrently to hide memory latency and achieve high throughput. However, different warps can exhibit significant heterogeneity in terms of their resource requirements, memory access patterns, and performance characteristics.

Research in this area aims to exploit this inter-warp heterogeneity to improve resource allocation, scheduling, and memory management in GPUs.

For instance, Kayıran et al. [2014] propose a warp-level divergence-aware cache management scheme that dynamically adapts the cache allocation and replacement policies based on the divergence characteristics of each warp. Warps with high divergence are allocated more cache resources to reduce memory divergence, while warps with low divergence are allocated fewer resources to improve cache utilization.

Another example is the work by Sethia et al. [2015], which proposes a memory controller design that exploits inter-warp heterogeneity to improve DRAM bank-level parallelism. The authors observe that different warps can have different degrees of bank-level parallelism, and propose a warp-aware memory scheduling algorithm that prioritizes warps with high bank-level parallelism to reduce memory contention and improve system throughput.

Coordinated Cache Bypassing

Cache bypassing is a technique that allows memory requests to skip the cache and directly access the lower-level memory hierarchy. While bypassing can help reduce cache pollution and improve the timeliness of memory accesses, uncoordinated bypassing decisions across different cores and memory partitions can lead to suboptimal performance.

Research in this area explores techniques for coordinating cache bypassing decisions across the GPU to improve overall system performance and resource utilization.

For example, Li et al. [2015] propose a coordinated cache bypassing scheme for GPUs that uses a centralized bypass controller to make global bypassing decisions. The controller collects runtime information from each core, such as cache miss rates and memory access patterns, and uses this information to determine the optimal bypassing strategy for each core. The authors show that their scheme can significantly improve performance and energy efficiency compared to uncoordinated bypassing.

Adaptive Cache Management

The optimal cache configuration for a GPU application can vary significantly depending on its memory access patterns, working set size, and resource requirements. Static cache management policies that are fixed at design time may not be able to adapt to the diverse and dynamic behavior of different applications.

Research in this area explores techniques for dynamically adapting the cache configuration and management policies based on the runtime behavior of the application.

For instance, Wang et al. [2016] propose an adaptive cache management scheme for GPUs that dynamically adjusts the cache partition sizes and replacement policies based on the memory access patterns of the application. The scheme uses a combination of hardware and software techniques to monitor the cache behavior and make dynamic adjustments to improve cache utilization and performance.

Another example is the work by Dai et al. [2018], which proposes a machine learning-based approach for adaptive cache management in GPUs. The authors use reinforcement learning to automatically learn the optimal cache configuration for each application based on its runtime behavior. The learned policies are then implemented using a reconfigurable cache architecture that can adapt to the specific needs of each application.

Cache Prioritization

In GPUs, different types of memory requests, such as load, store, and texture requests, can have different latency and bandwidth requirements. Prioritizing certain types of requests over others can help improve overall system performance and resource utilization.

Research in this area explores techniques for prioritizing different types of memory requests in the GPU cache hierarchy.

For example, Zhao et al. [2018] propose a cache prioritization scheme for GPUs that assigns different priorities to different types of memory requests based on their criticality and latency sensitivity. The scheme uses a combination of static and dynamic information, such as the instruction type and the number of dependent instructions, to determine the priority of each request. The authors show that their scheme can significantly improve performance and energy efficiency compared to a baseline GPU without prioritization.

Virtual Memory Page Placement

GPUs have traditionally relied on manual memory management, where the programmer is responsible for explicitly allocating and deallocating memory. However, recent GPUs have started to support virtual memory, which allows the operating system to automatically manage memory allocation and placement.

Research in this area explores techniques for optimizing virtual memory page placement in GPUs to improve memory access locality and reduce address translation overhead.

For instance, Zheng et al. [2016] propose a page placement scheme for GPUs that aims to improve memory access locality by placing pages that are frequently accessed together in the same memory channel or bank. The scheme uses a combination of hardware and software techniques to monitor the memory access patterns of the application and make dynamic page placement decisions.

Another example is the work by Ganguly et al. [2019], which proposes a virtual memory management scheme for GPUs that aims to reduce address translation overhead. The scheme uses a combination of hardware and software techniques, such as translation lookaside buffer (TLB) prefetching and page table compression, to reduce the latency and bandwidth overhead of address translation.

Data Placement

The placement of data in the GPU memory hierarchy can have a significant impact on memory access locality and performance. Optimizing data placement can help reduce memory latency, improve cache utilization, and increase memory bandwidth utilization.

Research in this area explores techniques for optimizing data placement in GPUs based on the memory access patterns and resource requirements of the application.

For example, Agarwal et al. [2015] propose a data placement scheme for GPUs that aims to improve memory access locality by placing data that is frequently accessed together in the same memory channel or bank. The scheme uses a combination of static and dynamic analysis to determine the optimal data placement for each application.

Another example is the work by Tang et al. [2017], which proposes a data placement scheme for GPUs that aims to improve memory bandwidth utilization by placing data in different memory channels based on their access patterns. The scheme uses a machine learning-based approach to predict the memory access patterns of the application and make dynamic data placement decisions.

Multi-Chip-Module GPUs

As the performance and power requirements of GPUs continue to increase, traditional single-chip designs may not be able to keep up with the demand. Multi-chip-module (MCM) designs, where multiple GPU chips are integrated into a single package, have emerged as a promising solution to this problem.

Research in this area explores the design and optimization of MCM GPUs, including the memory system architecture, interconnect design, and resource management.

For instance, Arunkumar et al. [2017] propose an MCM GPU design that uses a high-bandwidth, low-latency interconnect to connect multiple GPU chips. The authors also propose a memory system architecture that leverages the increased bandwidth and capacity of the MCM design to improve performance and energy efficiency.

Another example is the work by Milic et al. [2018], which proposes a resource management scheme for MCM GPUs that aims to improve resource utilization and reduce inter-chip communication overhead. The scheme uses a combination of hardware and software techniques to monitor the resource usage and communication patterns of the application and make dynamic resource allocation decisions.

Conclusion

The memory system is a critical component of modern GPU architectures, and its design and optimization can have a significant impact on overall system performance and efficiency. As the demands of parallel workloads continue to grow, researchers are exploring a wide range of techniques to improve the performance, scalability, and adaptability of GPU memory systems.

Some of the key research directions in this area include memory access scheduling and interconnect design, caching effectiveness, memory request prioritization and cache bypassing, exploiting inter-warp heterogeneity, coordinated cache bypassing, adaptive cache management, cache prioritization, virtual memory page placement, data placement, and multi-chip-module designs.

By exploring these and other techniques, researchers aim to develop GPU memory systems that can keep up with the increasing demands of parallel workloads while maintaining high performance and energy efficiency. As GPUs continue to evolve and find new applications in areas such as machine learning, scientific computing, and data analytics, the design and optimization of their memory systems will remain an important area of research and innovation.

Chapter 3 Parallel Programming Models Chapter 5 Gpu Memory System Design