How to Design GPU Chips
Chapter 7 Streaming Multiprocessor Design

Chapter 7: Streaming Multiprocessor Design in GPU Design

The streaming multiprocessor (SM) is the fundamental building block of NVIDIA GPU architectures. Each SM contains a set of CUDA cores that execute instructions in a SIMT (Single Instruction, Multiple Thread) fashion. The SM is responsible for managing and scheduling warps, handling branch divergence, and providing fast access to shared memory and caches. In this chapter, we will explore the microarchitecture of the SM, including its pipelines, warp scheduling mechanisms, register file design, and shared memory and L1 cache organization.

SM Microarchitecture and Pipelines

The SM is a highly parallel and pipelined processor designed to efficiently execute hundreds of threads concurrently. Figure 7.1 shows a simplified block diagram of an SM in the NVIDIA Volta architecture.

                                 Instruction Cache
                                         |
                                         v
                                    Warp Scheduler
                                         |
                                         v
                               Dispatch Unit (4 warps)
                                 |   |   |   |
                                 v   v   v   v
                               CUDA Core (FP64/FP32/INT)
                               CUDA Core (FP64/FP32/INT)
                               CUDA Core (FP64/FP32/INT)
                               ...
                               Tensor Core
                               Tensor Core
                               ...
                               Load/Store Unit
                               Load/Store Unit
                               ...
                               Special Function Unit
                                         ^
                                         |
                                Register File (64 KB)
                                         ^
                                         |
                                  Shared Memory / L1 Cache (96 KB)

Figure 7.1: Simplified block diagram of an SM in the NVIDIA Volta architecture.

The main components of the SM include:

  1. Instruction Cache: Stores frequently accessed instructions to reduce latency and improve throughput.

  2. Warp Scheduler: Selects warps that are ready to execute and dispatches them to the available execution units.

  3. Dispatch Unit: Fetches and decodes instructions for up to 4 warps per cycle and dispatches them to the appropriate execution units.

  4. CUDA Cores: Programmable execution units that support a wide range of integer and floating-point operations. Each SM in Volta contains 64 CUDA cores.

  5. Tensor Cores: Specialized execution units designed for accelerating deep learning and AI workloads. Each SM in Volta contains 8 Tensor Cores.

  6. Load/Store Units: Handle memory operations, including loads and stores to global memory, shared memory, and caches.

  7. Special Function Units: Execute transcendental and other complex math operations.

  8. Register File: Provides fast access to thread-private registers. Each SM in Volta has a 64 KB register file.

  9. Shared Memory / L1 Cache: A configurable memory space that can be used as a software-managed cache (shared memory) or as a hardware-managed L1 data cache.

The SM pipeline is designed to maximize throughput by allowing multiple warps to execute concurrently and hide memory latency. Figure 7.2 illustrates a simplified view of the SM pipeline.

    Instruction Fetch
            |
            v
    Instruction Decode
            |
            v
    Operand Collection
            |
            v
    Execution (CUDA Cores, Tensor Cores, Load/Store Units, Special Function Units)
            |
            v
    Writeback

Figure 7.2: Simplified SM pipeline.

The pipeline stages are as follows:

  1. Instruction Fetch: The warp scheduler selects a warp that is ready to execute and fetches the next instruction for that warp from the instruction cache.

  2. Instruction Decode: The fetched instruction is decoded to determine the operation type, operands, and destination registers.

  3. Operand Collection: The required operands for the instruction are collected from the register file or shared memory.

  4. Execution: The instruction is executed on the appropriate execution unit (CUDA Core, Tensor Core, Load/Store Unit, or Special Function Unit).

  5. Writeback: The result of the execution is written back to the register file or shared memory.

To achieve high performance, the SM employs several techniques to maximize resource utilization and hide latency:

  • Dual-Issue: The SM can issue two independent instructions per warp in a single cycle, allowing for increased instruction-level parallelism.
  • Pipelined Execution Units: The execution units are pipelined, enabling the SM to start a new operation on a unit before the previous operation has completed.
  • Latency Hiding: The SM can switch between warps on a cycle-by-cycle basis, allowing it to hide the latency of memory accesses and long-latency operations by executing instructions from other warps.

Example 7.1 shows a simple CUDA kernel that performs element-wise addition of two vectors.

__global__ void vectorAdd(int *a, int *b, int *c, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        c[tid] = a[tid] + b[tid];
    }
}

Example 7.1: CUDA kernel for vector addition.

In this example, each thread in the kernel computes the sum of the corresponding elements from the input vectors a and b and stores the result in the output vector c. The SM executes this kernel by assigning each thread to a CUDA core and scheduling warps of threads to execute on the available cores. The load/store units are used to fetch the input data from global memory and write the results back.

Warp Scheduling and Divergence Handling

Efficient warp scheduling is crucial for maximizing the performance of the SM. The warp scheduler is responsible for selecting warps that are ready to execute and dispatching them to the available execution units. The primary goal of the warp scheduler is to keep the execution units busy by ensuring that there are always warps available to execute.

The SM employs a two-level warp scheduling mechanism:

  1. Warp Scheduling: The warp scheduler selects warps that are ready to execute based on a scheduling policy, such as round-robin or oldest-first. The selected warps are then dispatched to the available execution units.

  2. Instruction Scheduling: Within each warp, the SM schedules instructions based on their dependencies and the availability of execution units. The SM can issue multiple independent instructions from the same warp in a single cycle to maximize instruction-level parallelism.

Figure 7.3 illustrates the two-level warp scheduling mechanism.

    Warp Pool
    Warp 1 (Ready)
    Warp 2 (Waiting)
    Warp 3 (Ready)
    ...
    Warp N (Ready)
        |
        v
    Warp Scheduler
        |
        v
    Dispatch Unit
        |
        v
    Execution Units

Figure 7.3: Two-level warp scheduling mechanism.

One of the key challenges in warp scheduling is handling branch divergence. In the SIMT execution model, all threads in a warp execute the same instruction in lockstep. However, when a warp encounters a branch instruction (e.g., an if-else statement), some threads may take the if-path while others take the else-path. This situation is called branch divergence.

To handle branch divergence, the SM employs a technique called predication. When a warp encounters a divergent branch, the SM executes both paths of the branch sequentially, masking off the threads that do not take each path. The results are then combined using predicate registers to ensure that each thread receives the correct result.

Example 7.2 shows a CUDA kernel with a divergent branch.

__global__ void divergentKernel(int *data, int *result) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (data[tid] > 0) {
        result[tid] = data[tid] * 2;
    } else {
        result[tid] = data[tid] * 3;
    }
}

Example 7.2: CUDA kernel with a divergent branch.

In this example, the branch condition data[tid] > 0 may cause some threads in a warp to take the if-path while others take the else-path. The SM handles this divergence by executing both paths sequentially and masking off the inactive threads in each path.

Figure 7.4 illustrates the predication process for a warp with divergent threads.

    Warp (32 threads)
    Thread 1: data[1] = 5, result[1] = 10
    Thread 2: data[2] = -3, result[2] = -9
    ...
    Thread 32: data[32] = 7, result[32] = 14

    Divergent Branch:
    if (data[tid] > 0) {
        result[tid] = data[tid] * 2;
    } else {
        result[tid] = data[tid] * 3;
    }

    Predication:
    Step 1: Execute if-path with mask
        Thread 1: result[1] = 10
        Thread 2: (masked off)
        ...
        Thread 32: result[32] = 14

    Step 2: Execute else-path with mask
        Thread 1: (masked off)
        Thread 2: result[2] = -9
        ...
        Thread 32: (masked off)

    Final Result:
    Thread 1: result[1] = 10
    Thread 2: result[2] = -9
    ...
    Thread 32: result[32] = 14

Figure 7.4: Predication process for a warp with divergent threads.

By using predication, the SM can handle branch divergence without the need for explicit branch instructions or control flow divergence. However, divergent branches can still impact performance, as the SM must execute both paths sequentially, reducing the effective parallelism.

Register File and Operand Collectors

The register file is a critical component of the SM, providing fast access to thread-private registers. Each SM has a large register file to support the many active threads and enable efficient context switching between warps.

In the NVIDIA Volta architecture, each SM has a 64 KB register file, organized as 32 banks of 2 KB each. The register file is designed to provide high bandwidth and low latency access to support the large number of concurrent threads.

To minimize bank conflicts and improve performance, the SM employs a technique called operand collection. Operand collectors are specialized units that gather operands from the register file banks and deliver them to the execution units. By using operand collectors, the SM can reduce the impact of bank conflicts and improve the utilization of the execution units.

Figure 7.5 shows a simplified diagram of the register file and operand collectors in an SM.

    Register File (64 KB)
    Bank 1 (2 KB)
    Bank 2 (2 KB)
    ...
    Bank 32 (2 KB)
        |
        v
    Operand Collectors
        |
        v
    Execution Units

Figure 7.5: Register file and operand collectors in an SM.

The operand collectors work by gathering operands from multiple instructions and multiple warps, allowing the SM to issue instructions from different warps to the execution units in a single cycle. This helps to hide the latency of register file accesses and improves the overall throughput of the SM.

Example 7.3 shows a CUDA kernel that performs a dot product of two vectors.

__global__ void dotProduct(float *a, float *b, float *result, int n) {
    __shared__ float partialSum[256];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    partialSum[tid] = 0;
 
    while (i < n) {
        partialSum[tid] += a[i] * b[i];
        i += blockDim.x * gridDim.x;
    }
 
    __syncthreads();
 
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            partialSum[tid] += partialSum[tid + s];
        }
        __syncthreads();
    }
 
    if (tid == 0) {
        result[blockIdx.x] = partialSum[0];
    }
}

In this example, each thread computes a partial sum of the dot product using its assigned elements from the input vectors. The partial sums are stored in the shared memory array partialSum. After all threads have computed their partial sums, a parallel reduction is performed to sum up the partial sums and obtain the final dot product result.

The operand collector plays a crucial role in this example by efficiently gathering the operands for the shared memory accesses and the arithmetic operations. It helps avoid bank conflicts and improves the utilization of the execution units.

Conclusion

The streaming multiprocessor is the core computational unit in modern GPU architectures. Its design focuses on maximizing throughput and hiding memory latency through a combination of fine-grained multithreading, SIMT execution, and efficient operand collection.

Key components of the SM include the warp scheduler, which selects warps for execution; the SIMT stack, which handles branch divergence and convergence; the register file and operand collectors, which provide fast access to thread-private registers; and the shared memory and L1 cache, which enable low-latency data sharing and reuse.

As GPU architectures continue to evolve, research in areas such as branch divergence handling, warp scheduling, and register file design will be crucial for improving the performance and efficiency of future GPUs. Novel techniques like dynamic warp formation, thread block compaction, and operand reuse caches have the potential to significantly enhance the capabilities of the SM and enable new levels of performance in parallel computing workloads.