Chapter 10: Reliability and Fault Tolerance in GPU Design

As GPUs become increasingly prevalent in safety-critical applications such as autonomous vehicles, robotics, and medical devices, ensuring their reliability and fault tolerance becomes paramount. GPUs are susceptible to various types of faults and errors that can lead to system failures, data corruption, or incorrect results. In this chapter, we will explore the types of faults and errors in GPUs, error detection and correction schemes, checkpoint and recovery mechanisms, and design principles for reliability and resilience.

Types of Faults and Errors in GPUs

Faults and errors in GPUs can be classified into several categories based on their origin, duration, and impact on the system. Understanding these different types of faults and errors is crucial for developing effective mitigation strategies.

Soft Errors

Soft errors, also known as transient faults, are temporary errors caused by external factors such as cosmic rays, alpha particles, or electromagnetic interference. These errors do not cause permanent damage to the hardware and can be corrected by rewriting the affected data or restarting the affected operation.

Soft errors can manifest in various parts of the GPU, such as:

Flip-flops and latches: A single event upset (SEU) can cause the state of a flip-flop or latch to change, leading to incorrect data or control flow.
SRAM cells: Soft errors in SRAM cells, such as those used in caches and register files, can corrupt the stored data.
DRAM cells: Although less common than SRAM soft errors, DRAM cells can also experience bit flips due to external factors.

Figure 10.1 illustrates the impact of a soft error on a flip-flop.

           Cosmic Ray
               |
               |
               v
        +------------+
        |            |
D ----->|  Flip-Flop |----> Q
        |            |
        +------------+
               |
               |
               v
           Soft Error

Figure 10.1: Soft error caused by a cosmic ray striking a flip-flop.

Hard Errors

Hard errors, also known as permanent faults, are irreversible physical defects in the hardware that persist over time. These errors can be caused by manufacturing defects, wear-out, or physical damage to the device.

Examples of hard errors in GPUs include:

Stuck-at faults: A signal or storage element is permanently stuck at a logical '0' or '1' value, regardless of the input.
Bridging faults: Two or more signal lines are accidentally connected, causing a short circuit.
Open faults: A signal line is accidentally disconnected, causing a floating or indeterminate value.
Delay faults: A signal takes longer than expected to propagate through a path, leading to timing violations.

Figure 10.2 shows an example of a stuck-at fault in a logic gate.

        Stuck-at-0 Fault
               |
               |
               v
           +---+
        -->| & |-->
           |   |
        -->|   |
           +---+

Figure 10.2: Stuck-at-0 fault in an AND gate.

Intermittent Errors

Intermittent errors are faults that occur sporadically and are difficult to reproduce consistently. These errors can be caused by various factors, such as:

Marginal hardware: Components that are operating close to their specified limits, making them more susceptible to environmental factors or aging.
Environmental factors: Temperature fluctuations, voltage variations, or electromagnetic interference can trigger intermittent errors.
Aging effects: As the device ages, certain components may become more prone to intermittent failures due to wear-out or degradation.

Intermittent errors pose a significant challenge for error detection and correction, as they may not be captured by traditional testing or monitoring techniques.

Silent Data Corruption

Silent data corruption (SDC) refers to errors that corrupt data without being detected by the hardware or software. SDC can lead to incorrect results or system failures that may go unnoticed for an extended period.

Examples of SDC in GPUs include:

Arithmetic errors: Faults in arithmetic units, such as adders or multipliers, can produce incorrect results without raising any error flags.
Memory errors: Soft errors or hard faults in memory cells can corrupt data without being detected by error checking mechanisms.
Control flow errors: Faults in control logic or instruction decoders can cause the program to deviate from its intended execution path without triggering any exceptions.

SDC is particularly dangerous because it can propagate through the system and affect the final output without any visible symptoms. Detecting and mitigating SDC requires a combination of hardware and software techniques.

Error Detection and Correction Schemes

To mitigate the impact of faults and errors in GPUs, various error detection and correction schemes have been developed. These schemes aim to identify the presence of errors and, in some cases, correct them to ensure the correct operation of the system.

Parity Checking

Parity checking is a simple error detection technique that adds an extra bit (the parity bit) to each data word to make the total number of '1' bits either even (even parity) or odd (odd parity). By checking the parity of the data word, single-bit errors can be detected.

Figure 10.3 illustrates an example of even parity checking.

    Data Word: 1011010
    Parity Bit:      1
    Transmitted: 10110101

    Received:   10110111
    Parity Bit:      0
    Error Detected!

Figure 10.3: Even parity checking for error detection.

Parity checking can be applied to various components in the GPU, such as registers, caches, and memory interfaces. However, parity checking can only detect odd numbers of bit errors and cannot correct the errors.

Error Correcting Codes (ECC)

Error Correcting Codes (ECC) are more advanced error detection and correction schemes that can not only detect errors but also correct them. ECC works by adding redundant bits to the data word, which allows the receiver to identify and correct a limited number of bit errors.

One common ECC scheme is the Single Error Correction, Double Error Detection (SECDED) code, which can correct single-bit errors and detect double-bit errors. SECDED codes are often used in memory systems, such as DRAM and caches, to protect against soft errors.

Figure 10.4 shows an example of a SECDED code.

    Data Word: 1011010
    ECC Bits:    01101
    Transmitted: 101101001101

    Received:   101101011101
                       ^
                       |
                   Bit Error

    Corrected:  101101001101

Figure 10.4: SECDED code for error correction and detection.

Other ECC schemes, such as Bose-Chaudhuri-Hocquenghem (BCH) codes and Reed-Solomon codes, can correct multiple bit errors at the cost of higher redundancy and complexity.

Redundant Execution

Redundant execution is a technique that performs the same computation multiple times, either on the same hardware or on different hardware units, and compares the results to detect errors. If the results do not match, an error is detected, and the system can take appropriate actions, such as retrying the computation or initiating a recovery process.

Redundant execution can be implemented at various levels in the GPU:

Instruction-level redundancy: Each instruction is executed multiple times, and the results are compared before committing to the register file or memory.
Thread-level redundancy: Multiple threads perform the same computation, and their results are compared to detect errors.
Kernel-level redundancy: The entire kernel is executed multiple times, and the final outputs are compared to detect errors.

Figure 10.5 illustrates thread-level redundancy in a GPU.

    Thread 0   Thread 1   Thread 2   Thread 3
       |          |          |          |
       v          v          v          v
    +-------+  +-------+  +-------+  +-------+
    | Comp. |  | Comp. |  | Comp. |  | Comp. |
    +-------+  +-------+  +-------+  +-------+
       |          |          |          |
       v          v          v          v
    +------------+------------+------------+
    |          Comparator                 |
    +------------+------------+------------+
                 |
                 v
            Error Detection

Figure 10.5: Thread-level redundancy for error detection.

Redundant execution can detect a wide range of errors, including soft errors, hard faults, and SDC. However, it comes at the cost of increased execution time and energy consumption.

Watchdog Timers

Watchdog timers are hardware or software mechanisms that monitor the execution of the GPU and detect if the system becomes unresponsive or fails to complete a task within a specified time limit. If the watchdog timer expires, it indicates an error, and the system can initiate a recovery process, such as resetting the GPU or restarting the affected operation.

Watchdog timers can be implemented at various levels in the GPU:

Kernel-level watchdog: Monitors the execution time of each kernel and detects if a kernel fails to complete within a specified time limit.
Thread-level watchdog: Monitors the execution time of each thread and detects if a thread fails to complete within a specified time limit.

Checkpoint and Recovery Mechanisms

Checkpoint and recovery mechanisms are used to save the state of a GPU application at regular intervals and restore the state in case of a failure. By periodically saving the state of the application, the system can recover from failures without having to restart the entire computation from the beginning.

Checkpoint and recovery mechanisms can be implemented at different levels in the GPU:

Application-level checkpointing: The application itself is responsible for saving its state at regular intervals. This can be done by explicitly saving the contents of memory and registers to a checkpoint file.
System-level checkpointing: The GPU runtime system or driver is responsible for saving the state of the application. This can be done transparently to the application, without requiring any modifications to the application code.
Hardware-level checkpointing: The GPU hardware itself provides support for saving and restoring the state of the application. This can be done using dedicated hardware mechanisms, such as non-volatile memory or special-purpose registers.

Figure 10.8 illustrates a typical checkpoint and recovery process.

    Normal Execution
          |
          |
          v
      Checkpoint
          |
          |
          v
    Normal Execution
          |
          |
          v
       Failure
          |
          |
          v
        Restore
          |
          |
          v
    Normal Execution

Figure 10.8: Checkpoint and recovery process.

During normal execution, the system periodically saves the state of the application to a checkpoint. If a failure occurs, the system restores the state from the most recent checkpoint and resumes execution from that point.

Checkpoint and recovery mechanisms can help improve the reliability and resilience of GPU applications, especially for long-running computations. However, they also introduce overhead in terms of storage space and execution time, as saving and restoring state requires additional resources.

Designing for Reliability and Resilience

Designing GPUs for reliability and resilience involves a combination of hardware and software techniques. Some key design principles and techniques include:

Error detection and correction: Incorporating error detection and correction mechanisms, such as ECC and parity checking, at various levels in the GPU, including memories, caches, and interconnects.
Redundancy: Using redundant hardware components, such as spare cores or memory modules, to provide fault tolerance and enable graceful degradation in the presence of failures.
Checkpoint and recovery: Implementing checkpoint and recovery mechanisms to save the state of the application and enable recovery from failures.
Fault containment: Designing the GPU architecture to limit the propagation of errors and prevent faults from spreading across the system. This can be achieved through techniques such as partitioning, isolation, and error containment barriers.
Software resilience: Developing software techniques, such as algorithm-based fault tolerance (ABFT), that enable applications to detect and recover from errors through software-level redundancy and checking.
Reliability-aware scheduling: Adapting the scheduling of tasks and resources in the GPU to account for the reliability characteristics of different components and optimize for both performance and reliability.

Example: Reliability-aware scheduling in a GPU

Consider a GPU with multiple cores, where some cores are known to be more prone to errors than others. A reliability-aware scheduler can assign critical tasks or tasks with high reliability requirements to the more reliable cores, while assigning less critical tasks to the less reliable cores.

Figure 10.9 illustrates a reliability-aware scheduling approach.

    Task Queue
    +-------+
    | Task1 |
    | Task2 |
    | Task3 |
    | Task4 |
    +-------+
        |
        |
        v
    Reliability-Aware Scheduler
        |
        |
        v
    +--------+--------+
    | Core 1 | Core 2 |
    |  (HR)  |  (LR)  |
    +--------+--------+
    | Task1  | Task3  |
    | Task2  | Task4  |
    +--------+--------+

Figure 10.9: Reliability-aware scheduling in a GPU (HR: High Reliability, LR: Low Reliability).

In this example, the scheduler assigns Task1 and Task2, which have high reliability requirements, to Core 1, which is known to be more reliable. Task3 and Task4, which have lower reliability requirements, are assigned to Core 2, which is less reliable.

Conclusion

Reliability and fault tolerance are critical aspects of GPU design and operation, especially as GPUs are increasingly used in safety-critical applications. Understanding the types of faults and errors that can occur in GPUs, as well as the techniques for detecting, correcting, and recovering from these faults, is essential for designing reliable and resilient GPU systems.

Error detection and correction schemes, such as ECC and parity checking, play a crucial role in identifying and mitigating soft errors and hard faults in various components of the GPU. Checkpoint and recovery mechanisms enable the system to save the state of the application and recover from failures, improving the overall resilience of the system.

Designing GPUs for reliability and resilience involves a holistic approach that combines hardware and software techniques. Redundancy, fault containment, software resilience, and reliability-aware scheduling are some of the key techniques that can be employed to improve the reliability and fault tolerance of GPUs.

As GPUs continue to evolve and find new applications in domains such as autonomous vehicles, robotics, and healthcare, ensuring their reliability and resilience will become increasingly important. Novel techniques for error detection and correction, checkpoint and recovery, and reliability-aware resource management will be essential for enabling the next generation of reliable and fault-tolerant GPUs.

Chapter 9 Power Energy and Thermal Management Chapter 11 Gpu Research Directions on Scalarization and Affine Execution