How to Design GPU Chips
Chapter 11 Gpu Research Directions on Scalarization and Affine Execution

Chapter 11: GPU Research Directions on Scalarization and Affine Execution

As described in Chapter 2, GPU computing APIs, such as CUDA and OpenCL, feature a MIMD-like programming model that allows the programmer to launch a large array of scalar threads onto the GPU. While each of these scalar threads can follow its unique execution path and may access arbitrary memory locations, in the common case, they all follow a small set of execution paths and perform similar operations.

The convergent control-flow among GPU threads is exploited on most, if not all, modern GPUs via the SIMT execution model, where scalar threads are grouped into warps that run on SIMD hardware (see Section 3.1.1). This chapter summarizes a series of research that further exploit the similarity of these scalar threads via scalarization and affine execution.

The key insight of this research lies in the observation of value structure [Kim et al., 2013] across threads executing the same compute kernel. The two types of value structure, uniform and affine, are illustrated in the compute kernel in Example 11.1.

Uniform Variable A variable that has the same constant value for every thread in the compute kernel. In Algorithm 11.1, the variable a, as well as the literals THRESHOLD and Y_MAX_VALUE, all have uniform values across all threads in the compute kernel. A uniform variable can be stored in a single scalar register, and reused by all threads in the compute kernel.

Affine Variable A variable with values that are a linear function of the thread ID for every thread in the compute kernel. In Algorithm 11.1, the memory address of the variable y[idx] can be represented as an affine transform of the thread ID threadIdx.x:

&(y[idx]) = &(y[0]) + sizeof(int) * threadIdx.x;

This affine representation can be stored as a pair of scalar values, a base and a stride, which is far more compact than the fully expanded vector.

__global__ void vsadd( int y[], int a ) {
    int idx = threadIdx.x;
    y[idx] = y[idx] + a;
    if ( y[idx] > THRESHOLD )
        y[idx] = Y_MAX_VALUE;
}

Algorithm 11.1: Example of scalar and affine operations in a compute kernel (from [Kim et al., 2013]).

There are multiple research proposals on how to detect and exploit uniform or affine variables in GPUs. The rest of this chapter summarizes these proposals in these two aspects.

Detection of Uniform or Affine Variables

There are two main approaches to detect the existence of uniform or affine variables in a GPU compute kernel: Compiler-Driven Detection and Detection via Hardware.

Compiler-Driven Detection

One way to detect the existence of uniform or affine variables in a GPU compute kernel is to do so via a special compiler analysis. This is possible because the existing GPU programming models, CUDA and OpenCL, already provide means for the programmer to declare a variable as constant throughout the compute kernel, as well as providing a special variable for the thread ID. The compiler can perform a control-dependency analysis to detect variables that are dependent purely on constants and thread IDs, and mark them as uniform/affine. Operations that work solely on uniform/affine variables are then candidates for scalarization.

AMD GCN [AMD, 2012] relies on the compiler to detect uniform variables and scalar operations that can be stored and processed by a dedicated scalar processor.

Asanovic et al. [2013] introduce a combined convergent and variant analysis that allows the compiler to determine operations in an arbitrary compute kernel that are eligible for scalarization and/or affine transformation. Instructions within the convergent regions of a compute kernel can be converted into scalar/affine instructions. At any transition from divergent to convergent regions of a compute kernel, the compiler inserts a syncwarp instruction to handle control-flow induced register dependencies between the two regions. Asanovic et al. [2013] adopted this analysis to generate scalar operations for the Temporal-SIMT architecture [Keckler et al., 2011, Krashinsky, 2011].

Decoupled Affine Computation (DAC) [Wang and Lin, 2017] relies on a similar compiler analysis to extract scalar and affine candidates to be decoupled into a separate warp. Wang and Lin [2017] augment the process with a divergent affine analysis, with the goal to extract strands of instructions