How to Design GPU Chips
Chapter 5 Gpu Memory System Design

Chapter 5: GPU Memory System Design

Graphics Processing Units (GPUs) have evolved into highly parallel, programmable accelerators capable of achieving high performance and energy efficiency on a wide range of applications. The memory system is a critical component of modern GPU architectures, as it must supply the massive number of concurrent threads with fast access to data. In this chapter, we will explore the key elements of GPU memory system design, including DRAM technologies used in GPUs, memory controllers and arbitration, shared memory and caches, and techniques for efficient memory utilization.

DRAM Technologies for GPUs

Dynamic Random Access Memory (DRAM) is the primary technology used for implementing main memory in modern computing systems, including GPUs. DRAM offers high density and relatively low cost compared to other memory technologies. However, DRAM also has higher access latency and lower bandwidth compared to on-chip memories like caches and register files.

GPUs typically employ specialized DRAM technologies that are optimized for high bandwidth rather than low latency. Some common DRAM technologies used in GPUs include:

  1. GDDR (Graphics Double Data Rate): GDDR is a specialized DRAM technology designed for graphics cards and game consoles. It offers higher bandwidth than standard DDR DRAM by using a wider bus and higher clock speeds. GDDR5 and GDDR6 are the most recent versions, offering bandwidths of up to 512 GB/s and 768 GB/s, respectively.

  2. HBM (High Bandwidth Memory): HBM is a high-performance 3D-stacked DRAM technology that provides very high bandwidth and low power consumption. HBM stacks multiple DRAM dies on top of each other and connects them using through-silicon vias (TSVs), enabling much higher data transfer rates than traditional DRAM. HBM2 can provide bandwidths of up to 1 TB/s.

Figure 5.1 illustrates the difference between traditional GDDR memory and 3D-stacked HBM.

   GDDR Memory                         HBM Memory
  ____________                   ______________________  
 |            |                 |  ___________________  |
 |   DRAM     |                 | |                   | |
 |   Chips    |                 | |      DRAM Dies    | |
 |            |                 | |___________________| |
 |            |                 |           .          |
 |            |                 |           .          | 
 |            |                 |           .          |
 |____________|                 |  ___________________  |
      |                         | |                   | |
     PCB                        | |  Logic Die (GPU)  | |
                                | |___________________| |
                                |______________________|

Figure 5.1: Comparison of GDDR and HBM memory architectures.

The choice of DRAM technology depends on the specific requirements of the GPU, such as power budget, form factor, and target applications. High-end GPUs for gaming and professional graphics often use GDDR6 for its high bandwidth, while HBM2 is more common in data center and HPC GPUs where power efficiency is a key concern.

Memory Controllers and Arbitration

Memory controllers are responsible for managing the flow of data between the GPU and the off-chip DRAM. They handle memory requests from the GPU cores, schedule DRAM commands, and optimize memory access patterns to maximize bandwidth utilization and minimize latency.

GPU memory controllers typically employ a multi-channel design to provide high bandwidth and parallel access to DRAM. Each memory channel is connected to one or more DRAM chips and has its own command and data buses. The memory controller distributes memory requests across the available channels to maximize parallelism and avoid channel conflicts.

Figure 5.2 shows a simplified diagram of a GPU memory controller with four channels.

          GPU Cores
              |
        ______|______
       |             |
       |  Memory     |
       |  Controller |
       |_____________|
         |    |    |    |
        Ch0  Ch1  Ch2  Ch3
         |    |    |    |
        DRAM DRAM DRAM DRAM

Figure 5.2: GPU memory controller with four channels.

Memory arbitration is the process of deciding which memory requests should be serviced first when there are multiple outstanding requests. GPUs employ various arbitration policies to optimize memory system performance and fairness:

  1. First-Come, First-Served (FCFS): The simplest arbitration policy, where requests are serviced in the order they arrive. FCFS is fair but can lead to suboptimal performance due to lack of request reordering.

  2. Round-Robin (RR): Requests are serviced in a cyclic order, ensuring equal priority for all requestors. RR provides fairness but may not optimize for locality or urgency of requests.

  3. Priority-Based: Requests are assigned priorities based on various criteria, such as the type of request (e.g., read vs. write), the source (e.g., texture vs. L2 cache), or the age of the request. Higher-priority requests are serviced first.

  4. Deadline-Aware: Requests are scheduled based on their deadlines to ensure timely completion. This is particularly important for real-time graphics applications.

  5. Locality-Aware: The memory controller attempts to schedule requests that access nearby memory locations together to maximize row buffer hits and minimize DRAM precharge and activation overhead.

Advanced GPU memory controllers often employ a combination of these arbitration policies to achieve the best balance between performance, fairness, and real-time requirements.

Shared Memory and Caches

GPUs employ a hierarchical memory system that includes both software-managed and hardware-managed caches to reduce the latency and bandwidth demands on the main memory.

Shared Memory

Shared memory is a software-managed, on-chip memory space that is shared among the threads of a thread block (NVIDIA) or workgroup (OpenCL). It acts as a user-controlled cache, allowing programmers to explicitly manage data movement and reuse within a thread block.

Shared memory is typically implemented using fast, multi-ported SRAM banks to provide low-latency, high-bandwidth access. Each bank can service one memory request per cycle, so the hardware must arbitrate among concurrent accesses to the same bank to avoid conflicts.

Figure 5.3 illustrates the organization of shared memory in a GPU core.

        Thread Block
   ______________________
  |  _________________   |
  | |    Thread 0     |  |
  | |_________________|  |
  |         .            |
  |         .            |
  |         .            |
  |  _________________   |
  | |    Thread N-1   |  |
  | |_________________|  |
  |______________________|
             |
     ________|________
    |                 |
    |  Shared Memory  |
    |  ____________   |
    | | Bank 0     |  |
    | |____________|  |
    | | Bank 1     |  |
    | |____________|  |
    |       .         |
    |       .         |
    |       .         |
    | | Bank M-1   |  |
    | |____________|  |
    |_________________|

Figure 5.3: Shared memory organization in a GPU core.

Proper use of shared memory can significantly improve the performance of GPU kernels by reducing the number of accesses to the slower, off-chip DRAM. However, it requires careful programming to ensure efficient data sharing and avoid bank conflicts.

Hardware-Managed Caches

In addition to software-managed shared memory, GPUs also employ hardware-managed caches to automatically exploit data locality and reduce DRAM accesses. The most common types of hardware-managed caches in GPUs are:

  1. L1 Data Cache: A small, per-core cache that stores recently accessed global memory data. The L1 cache is typically private to each GPU core and is used to reduce the latency of global memory accesses.

  2. Texture Cache: A specialized cache designed to optimize access to read-only texture data. The texture cache is optimized for 2D spatial locality and supports hardware-accelerated filtering and interpolation operations.

  3. Constant Cache: A small, read-only cache that stores frequently accessed constant data. The constant cache is broadcast to all threads in a warp, making it efficient for data that is shared among many threads.

  4. L2 Cache: A larger, shared cache that sits between the GPU cores and the main memory. The L2 cache stores data that is evicted from the L1 caches and is used to reduce the number of DRAM accesses.

Figure 5.4 shows a typical GPU memory hierarchy with hardware-managed caches.

      GPU Core 0         GPU Core 1         GPU Core N-1
   ________________     ________________     ________________
  |                |   |                |   |                |
  |    L1 Data     |   |    L1 Data     |   |    L1 Data     |
  |     Cache      |   |     Cache      |   |     Cache      |
  |________________|   |________________|   |________________|
  |                |   |                |   |                |
  |    Texture     |   |    Texture     |   |    Texture     |
  |     Cache      |   |     Cache      |   |     Cache      |
  |________________|   |________________|   |________________|
  |                |   |                |   |                |
  |    Constant    |   |    Constant    |   |    Constant    |
  |     Cache      |   |     Cache      |   |     Cache      |
  |________________|   |________________|   |________________|
         |                     |                     |
         |_____________________|_____________________|
                               |
                        _______|_______
                       |               |
                       |   L2 Cache    |
                       |_______________|
                               |
                               |
                           Main Memory

Figure 5.4: GPU memory hierarchy with hardware-managed caches.

Hardware-managed caches help improve the performance of GPU applications by automatically exploiting data locality and reducing the number of DRAM accesses. However, they can also introduce cache coherence and consistency challenges, particularly in the context of parallel programming models like CUDA and OpenCL.

Techniques for Efficient Memory Utilization

Efficient utilization of the GPU memory system is crucial for achieving high performance and energy efficiency. Some key techniques for optimizing memory usage in GPU applications include:

  1. Coalescing: Arranging memory accesses from threads in a warp to adjacent memory locations, allowing the hardware to combine them into a single, wider memory transaction. Coalescing maximizes the utilization of DRAM bandwidth and reduces the number of memory transactions.

  2. Data Layout Optimization: Organizing data structures in memory to maximize spatial locality and minimize cache misses. This includes techniques like structure-of-arrays (SoA) layout, which groups data elements of the same type together, and array-of-structures (AoS) layout, which keeps data elements belonging to the same structure together.

  3. Caching and Prefetching: Utilizing hardware-managed caches effectively by exploiting temporal and spatial locality in memory access patterns. This can be achieved through techniques like data tiling, which breaks data into smaller chunks that fit in the cache, and software prefetching, which explicitly loads data into the cache before it is needed.

  4. Memory Access Scheduling: Reordering memory accesses to maximize row buffer hits and minimize DRAM precharge and activation overhead. This can be done through hardware mechanisms in the memory controller or through software techniques like access pattern optimization and data layout transformations.

  5. Compression: Applying data compression techniques to reduce the size of data transferred between memory and the GPU cores. This can help alleviate bandwidth bottlenecks and reduce energy consumption associated with data movement.

  6. Memory Virtualization: Employing virtual memory techniques to provide a unified, contiguous address space for GPU applications. This allows for more flexible memory management and enables features like demand paging, which can help reduce memory footprint and improve system utilization.

Figure 5.5 illustrates some of these techniques in the context of a GPU memory system.

       GPU Cores
          |
    ______|______
   |             |
   |  Coalescing |
   |_____________|
          |
    ______|______
   |             |
   | Data Layout |
   | Optimization|
   |_____________|
          |
    ______|______
   |             |
   | Caching and |
   | Prefetching |
   |_____________|
          |
    ______|______
   |             |
   |   Memory    |
   |   Access    |
   |  Scheduling |
   |_____________|
          |
    ______|______
   |             |
   | Compression |
   |_____________|
          |
    ______|______
   |             |
   |   Memory    |
   |Virtualization|
   |_____________|
          |
        DRAM

Figure 5.5: Techniques for efficient memory utilization in a GPU memory system.

  1. Coalescing: Arranging memory accesses from threads in a warp to adjacent memory locations, allowing the hardware to combine them into a single, wider memory transaction. Coalescing maximizes the utilization of DRAM bandwidth and reduces the number of memory transactions.

    Example:

    // Uncoalesced access pattern
    int idx = threadIdx.x;
    float val = input[idx * stride];
     
    // Coalesced access pattern
    int idx = threadIdx.x;
    float val = input[idx];
  2. Data Layout Optimization: Organizing data structures in memory to maximize spatial locality and minimize cache misses. This includes techniques like structure-of-arrays (SoA) layout, which groups data elements of the same type together, and array-of-structures (AoS) layout, which keeps data elements belonging to the same structure together.

    Example:

    // Array-of-Structures (AoS) layout
    struct Point {
        float x;
        float y;
        float z;
    };
    Point points[N];
     
    // Structure-of-Arrays (SoA) layout
    struct Points {
        float x[N];
        float y[N];
        float z[N];
    };
    Points points;
  3. Caching and Prefetching: Utilizing hardware-managed caches effectively by exploiting temporal and spatial locality in memory access patterns. This can be achieved through techniques like data tiling, which breaks data into smaller chunks that fit in the cache, and software prefetching, which explicitly loads data into the cache before it is needed.

    Example:

    // Data tiling
    for (int i = 0; i < N; i += TILE_SIZE) {
        for (int j = 0; j < N; j += TILE_SIZE) {
            // Process a tile of data that fits in the cache
            for (int ii = i; ii < i + TILE_SIZE; ii++) {
                for (int jj = j; jj < j + TILE_SIZE; jj++) {
                    // Perform computation on A[ii][jj]
                }
            }
        }
    }
  4. Memory Access Scheduling: Reordering memory accesses to maximize row buffer hits and minimize DRAM precharge and activation overhead. This can be done through hardware mechanisms in the memory controller or through software techniques like access pattern optimization and data layout transformations.

  5. Compression: Applying data compression techniques to reduce the size of data transferred between memory and the GPU cores. This can help alleviate bandwidth bottlenecks and reduce energy consumption associated with data movement.

    Example:

    • Delta encoding: Storing the differences between consecutive values instead of the actual values.
    • Run-length encoding: Replacing repeated values with a single instance and a count.
    • Huffman coding: Assigning shorter bit sequences to more frequently occurring values.
  6. Memory Virtualization: Employing virtual memory techniques to provide a unified, contiguous address space for GPU applications. This allows for more flexible memory management and enables features like demand paging, which can help reduce memory footprint and improve system utilization.

    Example:

    • Unified Virtual Addressing (UVA) in CUDA: Enables GPU threads to directly access CPU memory using a single pointer, simplifying memory management in heterogeneous systems.

Multi-Chip-Module GPUs

As the performance and power requirements of GPUs continue to increase, traditional single-chip designs may not be able to keep up with the demand. Multi-chip-module (MCM) designs, where multiple GPU chips are integrated into a single package, have emerged as a promising solution to this problem.

MCM GPU designs offer several advantages:

  1. Higher memory bandwidth: By integrating multiple memory stacks or chips, MCM GPUs can provide significantly higher memory bandwidth compared to single-chip designs.

  2. Improved scalability: MCM designs allow for the integration of more compute units and memory controllers, enabling GPUs to scale to higher performance levels.

  3. Better yield and cost-efficiency: Smaller individual chips in an MCM design can have better manufacturing yields and be more cost-effective compared to large monolithic chips.

However, MCM GPU designs also introduce new challenges, such as:

  1. Inter-chip communication: Efficient communication between the different chips in an MCM package is crucial for performance. High-bandwidth, low-latency interconnects are required to minimize the overhead of data movement between chips.

  2. Power delivery and thermal management: MCM designs require careful power delivery and thermal management strategies to ensure optimal performance and reliability.

  3. Software support: MCM GPUs may require changes to the programming model and runtime systems to fully exploit the benefits of the multi-chip architecture.

Research in this area explores the design and optimization of MCM GPUs, including the memory system architecture, interconnect design, and resource management.

For instance, Arunkumar et al. [2017] propose an MCM GPU design that uses a high-bandwidth, low-latency interconnect to connect multiple GPU chips. The authors also propose a memory system architecture that leverages the increased bandwidth and capacity of the MCM design to improve performance and energy efficiency.

Another example is the work by Milic et al. [2018], which proposes a resource management scheme for MCM GPUs that aims to improve resource utilization and reduce inter-chip communication overhead. The scheme uses a combination of hardware and software techniques to monitor the resource usage and communication patterns of the application and make dynamic resource allocation decisions.

Conclusion

The memory system is a critical component of modern GPU architectures, and its design and optimization can have a significant impact on overall system performance and efficiency. As the demands of parallel workloads continue to grow, researchers are exploring a wide range of techniques to improve the performance, scalability, and adaptability of GPU memory systems.

Some of the key research directions in this area include memory access scheduling and interconnect design, caching effectiveness, memory request prioritization and cache bypassing, exploiting inter-warp heterogeneity, coordinated cache bypassing, adaptive cache management, cache prioritization, virtual memory page placement, data placement, and multi-chip-module designs.

By exploring these and other techniques, researchers aim to develop GPU memory systems that can keep up with the increasing demands of parallel workloads while maintaining high performance and energy efficiency. As GPUs continue to evolve and find new applications in areas such as machine learning, scientific computing, and data analytics, the design and optimization of their memory systems will remain an important area of research and innovation.