How to Design GPU Chips
Chapter 3 Parallel Programming Models

Chapter 3: Parallel Programming Models in GPU Design

Graphics Processing Units (GPUs) have evolved from fixed-function graphics accelerators to highly parallel, programmable computing engines capable of accelerating a wide range of applications. To enable programmers to effectively harness the massive parallelism in GPUs, several parallel programming models and APIs have been developed, such as NVIDIA CUDA, OpenCL, and DirectCompute. These programming models provide abstractions that allow programmers to express parallelism in their applications while hiding the low-level details of the GPU hardware.

In this chapter, we will explore the key concepts and principles behind parallel programming models for GPUs, focusing on the SIMT (Single Instruction, Multiple Thread) execution model, the CUDA programming model and APIs, and the OpenCL framework. We will also discuss techniques for mapping algorithms to GPU architectures to achieve high performance and efficiency.

SIMT (Single Instruction, Multiple Thread) Execution Model

The SIMT execution model is the fundamental paradigm used by modern GPUs to achieve massive parallelism. In the SIMT model, a large number of threads execute the same program (called a kernel) in parallel, but each thread has its own program counter and can take different execution paths based on its thread ID and the data it operates on.

Kernels and Thread Hierarchy

A GPU kernel is a function that is executed in parallel by a large number of threads. When launching a kernel, the programmer specifies the number of threads to be created and how they are organized into a hierarchy of grids, blocks (or cooperative thread arrays - CTAs), and individual threads.

  • A grid represents the entire problem space and consists of one or more blocks.
  • A block is a group of threads that can cooperate and synchronize with each other via shared memory and barriers. Threads within a block are executed on the same GPU core (called streaming multiprocessor or compute unit).
  • Each thread has a unique ID within its block and grid, which can be used to compute memory addresses and make control flow decisions.

This hierarchical organization allows programmers to express both data parallelism (where the same operation is applied to multiple data elements) and task parallelism (where different tasks are executed in parallel).

Figure 3.1 illustrates the thread hierarchy in the SIMT execution model.

            Grid
       ________________
      /   /   /   /   /
     /   /   /   /   / 
    /   /   /   /   /
   /   /   /   /   /
  /__/__/__/__/__/
    |   |   |   |
    |   | Block |
    |   |   |   |
  Thread Thread ...

Figure 3.1: Thread hierarchy in the SIMT execution model.

SIMT Execution

In the SIMT execution model, each thread executes the same instruction but operates on different data. However, unlike SIMD (Single Instruction, Multiple Data) where all processing elements execute in lockstep, SIMT allows threads to have independent execution paths and diverge at branch instructions.

When a warp (a group of 32 threads in NVIDIA GPUs or 64 threads in AMD GPUs) encounters a branch instruction, the GPU hardware evaluates the branch condition for each thread in the warp. If all threads take the same path (converged), the warp continues execution normally. However, if some threads take different paths (diverged), the warp is split into two or more subwarps, each following a different path. The GPU hardware serializes the execution of the divergent paths, masking off the inactive threads in each subwarp. When all paths complete, the subwarps reconverge and continue execution in lockstep.

Figure 3.2 illustrates SIMT execution with divergent control flow.

         Warp
   ________________
  /   /   /   /   /
 /   /   /   /   / 
/   /   /   /   /
   |   |   |
   | Branch |
   |   |   |
  / \ / \ / \
 /   X   \   \
/   / \   \   \
   /   \   \
  /     \   \
 /       \   \
/         \   \
           \
            \
             \
   Reconvergence

Figure 3.2: SIMT execution with divergent control flow.

This divergence handling mechanism allows SIMT to support more flexible control flow than SIMD, but it comes at the cost of reduced SIMD efficiency when divergence occurs. Programmers should strive to minimize divergence within a warp to achieve optimal performance.

Memory Hierarchy

GPUs have a complex memory hierarchy to support the high bandwidth and low latency requirements of parallel workloads. The memory hierarchy typically consists of:

  • Global memory: The largest but slowest memory space, accessible by all threads in a kernel. Global memory is typically implemented using high-bandwidth GDDR or HBM memory.
  • Shared memory: A fast, on-chip memory space shared by all threads in a block. Shared memory is used for inter-thread communication and data sharing within a block.
  • Constant memory: A read-only memory space used for broadcasting read-only data to all threads.
  • Texture memory: A read-only memory space optimized for spatial locality and accessed via texture caches. Texture memory is more commonly used in graphics workloads.
  • Local memory: A private memory space for each thread, used for register spilling and large data structures. Local memory is typically mapped to global memory.

Effective utilization of the memory hierarchy is crucial for achieving high performance on GPUs. Programmers should aim to maximize the use of shared memory and minimize accesses to global memory to reduce memory latency and bandwidth bottlenecks.

Figure 3.3 illustrates the GPU memory hierarchy.

      ____________
     |            |
     |   Global   |
     |   Memory   |
      ____________
           |
      ____________
     |            |
     |  Constant  |
     |   Memory   |
      ____________
           |
      ____________
     |            |
     |  Texture   |
     |   Memory   |
      ____________
           |
           |
      ____________
     |            |
     |   Shared   |
     |   Memory   |
      ____________
           |
      ____________ 
     |            |
     |   Local    |
     |   Memory   |
      ____________

Figure 3.3: GPU memory hierarchy.

CUDA Programming Model and APIs

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA for general-purpose computing on GPUs. CUDA provides a set of extensions to standard programming languages, such as C, C++, and Fortran, that enable programmers to express parallelism and leverage the computational power of NVIDIA GPUs.

CUDA Programming Model

The CUDA programming model is based on the concept of kernels, which are functions executed in parallel by a large number of threads on the GPU. The programmer specifies the number of threads to be launched and their organization into a grid of thread blocks.

CUDA introduces several key abstractions to facilitate parallel programming:

  • Thread: The basic unit of execution in CUDA. Each thread has its own program counter, registers, and local memory.
  • Block: A group of threads that can cooperate and synchronize with each other. Threads within a block are executed on the same streaming multiprocessor and can communicate via shared memory.
  • Grid: A collection of thread blocks that execute the same kernel. The grid represents the entire problem space and can be one-, two-, or three-dimensional.

CUDA also provides built-in variables (e.g., threadIdx, blockIdx, blockDim, gridDim) that allow threads to identify themselves and compute memory addresses based on their position in the thread hierarchy.

Figure 3.4 illustrates the CUDA programming model.

            Grid
       ________________
      /   /   /   /   /
     /   /   /   /   / 
    /   /   /   /   /
   /   /   /   /   /
  /__/__/__/__/__/
    |   |   |   |
    |   | Block |
    |   |   |   |
  Thread Thread ...

Figure 3.4: CUDA programming model.

CUDA Memory Hierarchy

CUDA exposes the GPU memory hierarchy to the programmer, allowing explicit control over data placement and movement. The main memory spaces in CUDA are:

  • Global memory: Accessible by all threads in a kernel and persists across kernel launches. Global memory has the highest latency and is typically used for large data structures.
  • Shared memory: A fast, on-chip memory shared by all threads in a block. Shared memory is used for inter-thread communication and data sharing within a block.
  • Constant memory: A read-only memory space used for broadcasting read-only data to all threads. Constant memory is cached and provides low-latency access.
  • Texture memory: A read-only memory space optimized for spatial locality and accessed via texture caches. Texture memory is more commonly used in graphics workloads.
  • Local memory: A private memory space for each thread, used for register spilling and large data structures. Local memory is typically mapped to global memory.

Programmers can allocate and transfer data between the host (CPU) and device (GPU) memory using CUDA runtime APIs, such as cudaMalloc, cudaMemcpy, and cudaFree.

Figure 3.5 illustrates the CUDA memory hierarchy.

      ____________
     |            |
     |   Global   |
     |   Memory   |
      ____________
           |
      ____________
     |            |
     |  Constant  |
     |   Memory   |
      ____________
           |
      ____________
     |            |
     |  Texture   |
     |   Memory   |
      ____________
           |
           |
      ____________
     |            |
     |   Shared   |
     |   Memory   |
      ____________
           |
      ____________ 
     |            |
     |   Local    |
     |   Memory   |
      ____________

Figure 3.5: CUDA memory hierarchy.

CUDA Synchronization and Coordination

CUDA provides synchronization and coordination primitives to enable cooperation and communication between threads:

  • Barrier synchronization: The __syncthreads() function acts as a barrier that ensures all threads in a block have reached the same point before proceeding.
  • Atomic operations: CUDA supports atomic operations (e.g., atomicAdd, atomicExch) that allow threads to perform read-modify-write operations on shared or global memory without interference from other threads.
  • Warp-level primitives: CUDA provides warp-level intrinsics (e.g., __shfl, __ballot) that enable efficient communication and synchronization within a warp.

Proper use of synchronization and coordination primitives is essential for writing correct and efficient parallel programs in CUDA.

Example 3.1 shows a simple CUDA kernel that performs vector addition.

__global__ void vectorAdd(int *a, int *b, int *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}
 
int main() {
    int *a, *b, *c;
    int n = 1024;
    
    // Allocate memory on the host
    a = (int*)malloc(n * sizeof(int));
    b = (int*)malloc(n * sizeof(int));
    c = (int*)malloc(n * sizeof(int));
    
    // Initialize input vectors
    for (int i = 0; i < n; i++) {
        a[i] = i;
        b[i] = i * 2;
    }
    
    // Allocate memory on the device
    int *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, n * sizeof(int));
    cudaMalloc(&d_b, n * sizeof(int));
    cudaMalloc(&d_c, n * sizeof(int));
    
    // Copy input vectors from host to device
    cudaMemcpy(d_a, a, n * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, n * sizeof(int), cudaMemcpyHostToDevice);
    
    // Launch the kernel
    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;
    vectorAdd<<<numBlocks,blockSize>>>(d_a, d_b, d_c, n);
    
    // Copy result vector from device to host
    cudaMemcpy(c, d_c, n * sizeof(int), cudaMemcpyDeviceToHost);
    
    // Free device memory
    cudaFree(d_a);
    cudaFree(d_b); 
    cudaFree(d_c);
    
    // Free host memory
    free(a); 
    free(b);
    free(c);
    
    return 0;
}

This CUDA code launches the vectorAdd kernel with numBlocks blocks and blockSize threads per block. The kernel performs element-wise addition of the input vectors a and b and stores the result in vector c. The <<<...>>> syntax is used to specify the grid and block dimensions when launching a kernel.

CUDA Streams and Events

CUDA streams and events provide a mechanism for concurrent execution and synchronization of kernels and memory operations:

  • Streams: A sequence of operations (kernel launches, memory copies) that execute in order. Different streams can execute concurrently, allowing overlap of computation and memory transfers.
  • Events: Markers that can be inserted into a stream to record the completion of specific operations. Events can be used for synchronization and timing purposes.

Streams and events enable programmers to optimize the performance of their CUDA applications by overlapping computation and memory transfers and exploiting the full capabilities of the GPU hardware.

Example 3.2 demonstrates the use of CUDA streams to overlap kernel execution and memory transfers.

// Create two streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
 
// Asynchronously copy input data to the device
cudaMemcpyAsync(d_a, a, size, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d_b, b, size, cudaMemcpyHostToDevice, stream2);
 
// Launch kernels in different streams
kernelA<<<blocks, threads, 0, stream1>>>(d_a);
kernelB<<<blocks, threads, 0, stream2>>>(d_b);
 
// Asynchronously copy results back to the host
cudaMemcpyAsync(a, d_a, size, cudaMemcpyDeviceToHost, stream1);
cudaMemcpyAsync(b, d_b, size, cudaMemcpyDeviceToHost, stream2);
 
// Synchronize streams
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

In this example, two CUDA streams are created. Input data is copied to the device asynchronously using each stream. Then, kernels are launched in the different streams, allowing them to execute concurrently. Finally, the results are copied back to the host asynchronously, and the streams are synchronized to ensure all operations have completed.

OpenCL Framework

OpenCL (Open Computing Language) is an open, royalty-free standard for parallel programming across heterogeneous platforms, including CPUs, GPUs, FPGAs, and other accelerators. OpenCL provides a unified programming model and a set of APIs that allow developers to write portable and efficient parallel code.

OpenCL Programming Model

The OpenCL programming model is similar to CUDA, with a few key differences in terminology and abstractions:

  • Kernel: A function executed in parallel by a large number of work-items (threads) on an OpenCL device.
  • Work-item: The basic unit of execution in OpenCL, analogous to a thread in CUDA.
  • Work-group: A collection of work-items that can synchronize and share data through local memory. Work-groups are analogous to thread blocks in CUDA.
  • NDRange: Defines the index space and work-item organization for a kernel execution. It can be one, two, or three-dimensional.

OpenCL also defines a hierarchical memory model similar to CUDA:

  • Global memory: Accessible by all work-items in all work-groups, analogous to global memory in CUDA.
  • Local memory: Shared by all work-items in a work-group, analogous to shared memory in CUDA.
  • Private memory: Private to a single work-item, analogous to registers in CUDA.
  • Constant memory: Read-only memory accessible by all work-items.

OpenCL kernels are compiled at runtime by the OpenCL runtime. The host program can query the available OpenCL devices, select an appropriate device, create a context, and build the kernel for that specific device. This enables OpenCL applications to be highly portable across different hardware platforms.

Example 3.3 shows an OpenCL kernel that performs vector addition, similar to the CUDA example in Example 3.1.

__kernel void vectorAdd(__global const int *a, __global const int *b, __global int *c, int n) {
    int i = get_global_id(0);
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

The __kernel keyword defines an OpenCL kernel function. The __global keyword specifies that a pointer points to global memory. The get_global_id function returns the global index of the current work-item, which is used to compute the memory addresses for the input and output vectors.

Mapping Algorithms to GPU Architectures

Efficiently mapping algorithms to the GPU architecture is crucial for achieving high performance. Key considerations include:

  • Exposing sufficient parallelism: The algorithm should be decomposed into many fine-grained threads that can execute concurrently to fully utilize the GPU's parallel processing capabilities.

  • Minimizing branch divergence: Divergent control flow within a warp/wavefront can lead to serialization and reduced SIMD efficiency. Algorithms should be structured to minimize branch divergence where possible.

  • Exploiting memory hierarchy: Accessing global memory is expensive. Algorithms should maximize the use of shared memory and registers to reduce global memory accesses. Data should also be laid out in memory to enable coalesced memory accesses.

  • Balancing computation and memory accesses: Algorithms should have a high ratio of arithmetic operations to memory operations to effectively hide memory latency and achieve high computational throughput.

  • Minimizing host-device data transfers: Transferring data between host and device memory is slow. Algorithms should minimize such transfers by performing as much computation on the GPU as possible.

Several parallel algorithm design patterns are commonly used when developing GPU kernels:

  • Map: Each thread performs the same operation on a different data element, enabling simple parallel processing of large datasets.

  • Reduce: Parallel reduction is used to efficiently compute a single value (e.g., sum, maximum) from a large input dataset. Threads perform local reductions, which are then combined to produce the final result.

  • Scan: Also known as prefix sum, scan is used to compute the running sum of elements in an array. Efficient parallel scan algorithms are key building blocks for many GPU-accelerated applications.

  • Stencil: Each thread computes a value based on neighboring data elements. Stencil computations are common in scientific simulations and image processing applications.

  • Gather/Scatter: Threads read from (gather) or write to (scatter) arbitrary locations in global memory. Careful data layout and access patterns are required for efficiency.

Conclusion

GPU programming models like CUDA and OpenCL expose the parallel processing capabilities of modern GPUs to developers, enabling them to accelerate a wide range of applications. These programming models provide abstractions that allow fine-grained parallel workloads to be efficiently mapped to the GPU hardware.

Understanding the execution model, memory hierarchy, and synchronization primitives provided by these programming models is essential for writing high-performance GPU code. Developers must carefully consider factors such as thread organization, branch divergence, memory access patterns, and algorithm design to fully harness the computational power of GPUs.

As GPU architectures continue to evolve, programming models and tools must also advance to enable developers to effectively utilize new hardware features and capabilities. Ongoing research in areas such as programming language design, compiler optimization, and autotuning will be crucial for improving programmer productivity and performance portability in the era of heterogeneous computing.