How to Design GPU Chips
Chapter 2 Gpu Rogramming Models

Chapter 2: GPU Programming Models

Graphics Processing Units (GPUs) have evolved from fixed-function graphics accelerators to highly parallel, programmable computing engines capable of accelerating a wide range of applications. To enable programmers to effectively harness the massive parallelism in GPUs, several parallel programming models and APIs have been developed, such as NVIDIA CUDA, OpenCL, and DirectCompute. These programming models provide abstractions that allow programmers to express parallelism in their applications while hiding the low-level details of the GPU hardware.

In this chapter, we will explore the key concepts and principles behind parallel programming models for GPUs, focusing on the execution model, GPU instruction set architectures (ISAs), NVIDIA GPU ISAs, and AMD's Graphics Core Next (GCN) ISA. We will also provide examples to illustrate how these concepts are applied in practice.

Execution Model

The execution model of modern GPU programming models is based on the concept of kernels, which are functions that are executed in parallel by a large number of threads on the GPU. When launching a kernel, the programmer specifies the number of threads to be created and how they are organized into a hierarchy of grids, blocks (or cooperative thread arrays - CTAs), and individual threads.

  • A grid represents the entire problem space and consists of one or more blocks.
  • A block is a group of threads that can cooperate and synchronize with each other via shared memory and barriers. Threads within a block are executed on the same GPU core (called streaming multiprocessor or compute unit).
  • Each thread has a unique ID within its block and grid, which can be used to compute memory addresses and make control flow decisions.

This hierarchical organization allows programmers to express both data parallelism (where the same operation is applied to multiple data elements) and task parallelism (where different tasks are executed in parallel).

Figure 2.1 illustrates the thread hierarchy in the GPU execution model.

            Grid
       ________________
      /   /   /   /   /
     /   /   /   /   / 
    /   /   /   /   /
   /   /   /   /   /
  /__/__/__/__/__/
    |   |   |   |
    |   | Block |
    |   |   |   |
  Thread Thread ...

Figure 2.1: Thread hierarchy in the GPU execution model.

SIMT Execution

GPU programming models like CUDA and OpenCL follow a Single-Instruction, Multiple-Thread (SIMT) execution model. In the SIMT model, threads are executed in groups called warps (NVIDIA terminology) or wavefronts (AMD terminology). All threads within a warp execute the same instruction at the same time, but each thread operates on different data.

However, unlike the traditional Single-Instruction, Multiple-Data (SIMD) model, where all processing elements execute in lockstep, SIMT allows threads to have independent execution paths and diverge at branch instructions. When a warp encounters a branch instruction, the GPU hardware evaluates the branch condition for each thread in the warp. If all threads take the same path (converged), the warp continues execution normally. If some threads take different paths (diverged), the warp is split into two or more subwarps, each following a different path. The GPU hardware serializes the execution of the divergent paths, masking off the inactive threads in each subwarp. When all paths complete, the subwarps reconverge and continue execution in lockstep.

Figure 2.2 illustrates SIMT execution with divergent control flow.

         Warp
   ________________
  /   /   /   /   /
 /   /   /   /   / 
/   /   /   /   /
   |   |   |
   | Branch |
   |   |   |
  / \ / \ / \
 /   X   \   \
/   / \   \   \
   /   \   \
  /     \   \
 /       \   \
/         \   \
           \
            \
             \
   Reconvergence

Figure 2.2: SIMT execution with divergent control flow.

This divergence handling mechanism allows SIMT to support more flexible control flow than SIMD, but it comes at the cost of reduced SIMD efficiency when divergence occurs. Programmers should strive to minimize divergence within a warp to achieve optimal performance.

Memory Hierarchy

GPUs have a complex memory hierarchy to support the high bandwidth and low latency requirements of parallel workloads. The memory hierarchy typically consists of:

  • Global memory: The largest but slowest memory space, accessible by all threads in a kernel. Global memory is typically implemented using high-bandwidth GDDR or HBM memory.
  • Shared memory: A fast, on-chip memory space shared by all threads in a block. Shared memory is used for inter-thread communication and data sharing within a block.
  • Constant memory: A read-only memory space used for broadcasting read-only data to all threads.
  • Texture memory: A read-only memory space optimized for spatial locality and accessed via texture caches. Texture memory is more commonly used in graphics workloads.
  • Local memory: A private memory space for each thread, used for register spilling and large data structures. Local memory is typically mapped to global memory.

Effective utilization of the memory hierarchy is crucial for achieving high performance on GPUs. Programmers should aim to maximize the use of shared memory and minimize accesses to global memory to reduce memory latency and bandwidth bottlenecks.

Figure 2.3 illustrates the GPU memory hierarchy.

      ____________
     |            |
     |   Global   |
     |   Memory   |
      ____________
           |
      ____________
     |            |
     |  Constant  |
     |   Memory   |
      ____________
           |
      ____________
     |            |
     |  Texture   |
     |   Memory   |
      ____________
           |
           |
      ____________
     |            |
     |   Shared   |
     |   Memory   |
      ____________
           |
      ____________ 
     |            |
     |   Local    |
     |   Memory   |
      ____________

Figure 2.3: GPU memory hierarchy.

GPU Instruction Set Architectures

GPU instruction set architectures (ISAs) define the low-level interface between software and hardware. They specify the instructions, registers, and memory addressing modes supported by the GPU. Understanding GPU ISAs is essential for developing efficient GPU code and optimizing performance.

In this section, we will explore the ISAs of two major GPU vendors: NVIDIA and AMD. We will focus on NVIDIA's Parallel Thread Execution (PTX) and SASS ISAs, and AMD's Graphics Core Next (GCN) ISA.

NVIDIA GPU ISAs

NVIDIA GPUs support two levels of ISAs: PTX (Parallel Thread Execution) and SASS (Streaming ASSembler). PTX is a virtual ISA that provides a stable target for CUDA compilers, while SASS is the native ISA of NVIDIA GPUs.

PTX (Parallel Thread Execution)

PTX is a low-level, virtual ISA designed for NVIDIA GPUs. It is similar to LLVM IR or Java bytecode in that it provides a stable, architecture-independent target for compilers. CUDA programs are typically compiled to PTX code, which is then translated to the native SASS instructions by the NVIDIA GPU driver.

PTX supports a wide range of arithmetic, memory, and control flow instructions. It has an unlimited number of virtual registers and supports predication, which allows for efficient implementation of control flow. PTX also provides special instructions for thread synchronization, atomic operations, and texture sampling.

Here's an example of PTX code for a simple vector addition kernel:

.version 7.0
.target sm_70
.address_size 64

.visible .entry vecAdd(
    .param .u64 vecAdd_param_0,
    .param .u64 vecAdd_param_1,
    .param .u64 vecAdd_param_2,
    .param .u32 vecAdd_param_3
)
{
    .reg .b32 %r<4>;
    .reg .b64 %rd<6>;

    ld.param.u64 %rd1, [vecAdd_param_0];
    ld.param.u64 %rd2, [vecAdd_param_1];
    ld.param.u64 %rd3, [vecAdd_param_2];
    ld.param.u32 %r1, [vecAdd_param_3];
    cvta.to.global.u64 %rd4, %rd1;
    cvta.to.global.u64 %rd5, %rd2;
    mov.u32 %r2, %tid.x;
    mul.wide.u32 %rd6, %r2, 4;
    add.s64 %rd7, %rd4, %rd6;
    add.s64 %rd8, %rd5, %rd6;

    ld.global.f32 %f1, [%rd7];
    ld.global.f32 %f2, [%rd8];
    add.f32 %f3, %f1, %f2;

    cvta.to.global.u64 %rd9, %rd3;
    add.s64 %rd10, %rd9, %rd6;
    st.global.f32 [%rd10], %f3;

    ret;
}

This PTX code defines a kernel function vecAdd that takes four parameters: pointers to the input and output vectors, and the size of the vectors. The kernel computes the global thread ID, loads the corresponding elements from the input vectors, performs the addition, and stores the result to the output vector.

SASS (Streaming ASSembler)

SASS is the native ISA of NVIDIA GPUs. It is a low-level, machine-specific ISA that directly maps to the GPU hardware. SASS instructions are generated by the NVIDIA GPU driver from PTX code and are not typically visible to programmers.

SASS instructions are encoded in a compact format to reduce memory bandwidth and instruction cache footprint. They support a wide range of operand types, including registers, immediate values, and various addressing modes for memory access.

Here's an example of SASS code for the vector addition kernel:

code_version_number 90
                     @P0 LDG.E R2, [R8];
                     @P0 LDG.E R0, [R10];
                     @P0 FADD R0, R2, R0;
                     @P0 STG.E [R12], R0;
                         EXIT;

This SASS code corresponds to the PTX code shown earlier. It loads the input vector elements from global memory (LDG.E), performs the addition (FADD), stores the result back to global memory (STG.E), and exits the kernel.

AMD Graphics Core Next ISA

AMD GPUs use the Graphics Core Next (GCN) architecture and ISA. GCN is a RISC-based ISA that supports both graphics and compute workloads. It is designed for high performance, scalability, and power efficiency.

GCN introduces several key features, such as:

  • A scalar ALU for efficient execution of scalar operations and flow control.
  • A vector ALU for parallel execution of data-parallel operations.
  • A high-bandwidth memory system with support for atomic operations and low-latency access to shared memory.
  • A flexible addressing mode for memory operations, supporting base+offset and scalar+vector addressing.

Here's an example of GCN ISA code for a vector addition kernel:

.text
.globl vecAdd
.p2align 2
 
.type vecAdd,@function
vecAdd:
    .set DPTR, 0
 
    s_load_dwordx4 s[0:3], s[4:5], 0x0
    s_load_dword s4, s[4:5], 0x10
    s_waitcnt lgkmcnt(0)
 
    v_lshlrev_b32 v0, 2, v0
    v_add_u32 v1, vcc, s1, v0
    v_mov_b32 v3, s3
    v_addc_u32 v2, vcc, s2, v3, vcc
    flat_load_dword v1, v[1:2]
 
    v_add_u32 v3, vcc, s0, v0
    v_mov_b32 v5, s3
    v_addc_u32 v4, vcc, s2, v5, vcc
    flat_load_dword v0, v[3:4]
 
    v_add_f32 v0, v0, v1
    flat_store_dword v[3:4], v0
    s_endpgm

This GCN code loads the input vector elements using flat_load_dword, performs the addition using v_add_f32, and stores the result back to memory using flat_store_dword. The s_load_dwordx4 and s_load_dword instructions are used to load the kernel arguments from memory.

Mapping Algorithms to GPU Architectures

Efficiently mapping algorithms to the GPU architecture is crucial for achieving high performance. Key considerations include:

Exposing Sufficient Parallelism

The algorithm should be decomposed into many fine-grained threads that can execute concurrently to fully utilize the GPU's parallel processing capabilities. This often involves identifying data-parallel portions of the algorithm that can be executed independently on different data elements.

Minimizing Branch Divergence

Divergent control flow within a warp/wavefront can lead to serialization and reduced SIMD efficiency. Algorithms should be structured to minimize branch divergence where possible. This can be achieved by reducing the use of data-dependent branching or by rearranging the data layout to ensure threads within a warp follow similar execution paths.

Exploiting Memory Hierarchy

Accessing global memory is expensive. Algorithms should maximize the use of shared memory and registers to reduce global memory accesses. Data should also be laid out in memory to enable coalesced memory accesses, where threads in a warp access contiguous memory locations. Effective use of the memory hierarchy can significantly reduce memory latency and bandwidth bottlenecks.

Balancing Computation and Memory Accesses

Algorithms should have a high ratio of arithmetic operations to memory operations to effectively hide memory latency and achieve high computational throughput. This can be achieved by maximizing data reuse, prefetching data, and overlapping computation with memory accesses.

Minimizing Host-Device Data Transfers

Transferring data between host (CPU) and device (GPU) memory is slow. Algorithms should minimize such transfers by performing as much computation on the GPU as possible. Data should be transferred to the GPU in large batches and kept on the device for as long as needed to amortize the transfer overhead.

Several parallel algorithm design patterns are commonly used when developing GPU kernels:

  • Map: Each thread performs the same operation on a different data element, enabling simple parallel processing of large datasets.
  • Reduce: Parallel reduction is used to efficiently compute a single value (e.g., sum, maximum) from a large input dataset. Threads perform local reductions, which are then combined to produce the final result.
  • Scan: Also known as prefix sum, scan is used to compute the running sum of elements in an array. Efficient parallel scan algorithms are key building blocks for many GPU-accelerated applications.
  • Stencil: Each thread computes a value based on neighboring data elements. Stencil computations are common in scientific simulations and image processing applications.
  • Gather/Scatter: Threads read from (gather) or write to (scatter) arbitrary locations in global memory. Careful data layout and access patterns are required for efficiency.

Figure 3.20 illustrates an example of the map pattern, where each thread applies a function (e.g., square root) to a different element of the input array.

Input Array:  
               |  |   |   |   |   |   |   |
               v  v   v   v   v   v   v   v
              ______________________________
Threads:     |    |    |    |    |    |    |    |
             |____|____|____|____|____|____|____|
                |    |    |    |    |    |    |
                v    v    v    v    v    v    v
Output Array: 

Figure 3.20: Example of the map pattern in GPU programming.

Conclusion

GPU programming models like CUDA and OpenCL expose the parallel processing capabilities of modern GPUs to developers, enabling them to accelerate a wide range of applications. These programming models provide abstractions that allow fine-grained parallel workloads to be efficiently mapped to the GPU hardware.

Understanding the execution model, memory hierarchy, and synchronization primitives provided by these programming models is essential for writing high-performance GPU code. Developers must carefully consider factors such as thread organization, branch divergence, memory access patterns, and algorithm design to fully harness the computational power of GPUs.

As GPU architectures continue to evolve, programming models and tools must also advance to enable developers to effectively utilize new hardware features and capabilities. Ongoing research in areas such as programming language design, compiler optimization, and autotuning will be crucial for improving programmer productivity and performance portability in the era of heterogeneous computing.