Chapter 1: Introduction to GPU Chip Design

What are GPUs and how do they differ from CPUs

Graphics Processing Units (GPUs) are specialized electronic circuits designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs were originally developed to offload 2D and 3D graphics rendering from the CPU, enabling much higher performance for graphics-intensive applications like video games.

While CPUs are designed for general-purpose computing and feature complex control logic to support a wide variety of programs, GPUs have a highly parallel architecture consisting of thousands of smaller, more efficient cores designed for handling multiple tasks simultaneously. This makes them ideal for graphics rendering, where large blocks of data can be processed in parallel.

Key architectural differences between CPUs and GPUs include:

Core count: GPUs have a large number of small cores (hundreds to thousands), while CPUs have a few large, powerful cores (2-64).
Cache hierarchy: CPUs have large caches to reduce latency, while GPUs have smaller caches and rely more on high bandwidth to compensate for latency.
Control logic: CPUs have complex branch prediction and out-of-order execution capabilities. GPUs have much simpler control logic.
Instruction set: CPUs support a wide variety of instructions for general-purpose computing. GPU instruction sets are more limited and optimized for graphics.
Memory bandwidth: GPUs have very high memory bandwidth (up to 1 TB/s) to feed their many cores. CPUs have lower bandwidth (50-100 GB/s).
Floating-point performance: GPUs are capable of much higher floating-point performance, making them suitable for HPC and AI workloads.

In summary, the highly parallel architecture of GPUs allows them to excel at tasks that involve processing large blocks of data in parallel, while the more sophisticated control logic of CPUs makes them better suited for tasks that heavily rely on complex branching and control flow. Modern GPUs have evolved to become very programmable and are used for much more than just graphics today.

Key applications and importance of GPUs

Over the past two decades, GPUs have become one of the most important types of computing technology, as their highly parallel structure makes them more efficient than general-purpose CPUs for algorithms that process large blocks of data in parallel. Some of the key application areas that have driven the rapid advancement of GPU technology include:

Computer graphics and gaming

The most common use of GPUs is to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs excel at manipulating computer graphics and image processing, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where large blocks of data are processed in parallel. They are a standard component in modern gaming consoles and gaming PCs.

High performance computing (HPC)

The parallel processing capabilities of GPUs make them well-suited for scientific computing applications that involve processing very large datasets with parallel algorithms. GPUs have been widely adopted in supercomputers and HPC clusters, where they work alongside CPUs to accelerate highly-parallel workloads like weather forecasting, molecular dynamics simulations, and seismic analysis.

Artificial intelligence and machine learning

The parallel processing power of GPUs has been instrumental in the rapid advancement of deep learning and AI in recent years. Training complex deep neural networks requires an enormous amount of computing power, and GPUs have become the platform of choice for training large-scale AI models due to their ability to efficiently perform the matrix multiplication operations at the heart of deep learning algorithms. All major cloud AI platforms and supercomputers used for AI research today rely heavily on GPUs.

Cryptocurrency mining

GPUs have also been widely used for cryptocurrency mining, as their parallel processing capabilities make them well-suited for the cryptographic hashing algorithms used in proof-of-work based cryptocurrencies like Bitcoin. High-end GPUs from AMD and Nvidia were in very high demand during the cryptocurrency boom of 2017.

Accelerated computing and edge AI

With the slowing of Moore's Law, there has been a major trend towards accelerated, heterogeneous computing, with specialized accelerator chips like GPUs working alongside CPUs to speed up demanding workloads. GPUs are also being used to bring AI capabilities to edge devices like smartphones, smart speakers, and automotive systems. Mobile SoCs now commonly feature integrated GPUs that are used for both graphics and accelerating AI workloads.

The massive parallelism and high memory bandwidth of GPUs has made them one of the most important computing platforms today, with applications stretching far beyond computer graphics. As we hit the limits of general-purpose processors, specialized chips like GPUs, FPGAs and AI accelerators are becoming increasingly important computing engines of the future.

The landscape of computation accelerators

As the performance improvements from general-purpose CPUs have slowed down in recent years, there has been an increasing trend towards specialized accelerator chips that can speed up specific workloads. GPUs are one of the most prominent examples of accelerators, but there are several other important categories:

Field Programmable Gate Arrays (FPGAs)

FPGAs are semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected via programmable interconnects. FPGAs can be reprogrammed to desired application or functionality requirements after manufacturing, offering a more flexible alternative to ASICs. They are commonly used in aerospace and defense, ASIC prototyping, medical imaging, computer vision, speech recognition, and cryptography.

Application-Specific Integrated Circuits (ASICs)

ASICs are integrated circuits customized for a particular use, rather than intended for general-purpose use like CPUs. Modern ASICs often include entire 32-bit or 64-bit processors, memory blocks including ROM, RAM, EEPROM, flash memory and other large building blocks. ASICs are commonly used in bitcoin mining, AI accelerators, 5G wireless communication, and IoT devices.

AI Accelerators

AI accelerators are specialized chips designed to speed up AI workloads, particularly neural network training and inference. Examples include Google's Tensor Processing Units (TPUs), Intel's Nervana Neural Network Processors (NNPs), and a number of startups building AI chips from scratch. These chips leverage reduced precision math, efficient matrix multiplication circuits, and close integration of compute and memory to achieve much higher performance-per-watt than GPUs or CPUs on AI workloads.

Vision Processing Units (VPUs)

VPUs are specialized chips designed for accelerating computer vision and image processing workloads. They often include dedicated hardware for tasks like image signal processing, stereo vision, and CNN-based object detection. VPUs are commonly used in applications like automotive ADAS, drones, AR/VR headsets, smart cameras, and other edge devices that require low-latency visual processing.

Neuromorphic and Quantum Chips

Looking further out, neuromorphic chips attempt to mimic the brain's architecture to deliver fast and energy efficient neural network performance, while quantum chips leverage quantum mechanical effects to solve certain problems faster than classical computers. These are still emerging research areas but could become important accelerators in the future.

The overall trend in computing is towards domain-specific architectures and a diversity of accelerators being integrated alongside general-purpose CPUs to speed up important workloads. GPUs pioneered this accelerated computing model and remain one of the most important types of accelerators, but a wide variety of other accelerators are also becoming increasingly crucial across many application domains.

GPU hardware basics

A modern GPU is composed of several key hardware components:

Streaming Multiprocessors (SMs)

The SM is the basic building block of NVIDIA GPU architecture. Each SM contains a set of CUDA cores (typically 64 to 128) that share control logic and instruction cache. Each CUDA core has a fully pipelined integer arithmetic logic unit (ALU) and floating-point unit (FPU). Typically, a GPU chip has anywhere from 16 to 128 SMs, resulting in thousands of CUDA cores.

Texture/L1 Cache

Each SM has a dedicated texture cache and L1 cache to improve performance and reduce memory traffic. The texture cache is designed to optimize spatial locality and is particularly effective for graphics workloads. The L1 cache handles memory operations (load, store) and provides fast data access with low latency.

Shared Memory

Shared memory is a fast, on-chip memory that is shared among the CUDA cores within an SM. It can be used as a programmable cache, enabling higher bandwidth and lower latency access to frequently reused data. Shared memory is divided into equally-sized memory modules (banks) that can be accessed simultaneously by the cores.

Register File

Each SM has a large register file that provides low-latency storage for operands. The register file is divided among the resident threads on an SM, providing each thread with its own dedicated set of registers. Accessing a register typically takes zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts.

Warp Scheduler

The warp scheduler is responsible for managing and scheduling warps on an SM. A warp is a group of 32 threads that execute concurrently on the CUDA cores. The warp scheduler selects warps that are ready to execute and dispatches them to the cores, enabling high utilization and latency hiding.

Interconnect Network

The interconnect network connects the SMs to the GPU's shared L2 cache and memory controllers. It is typically implemented as a crossbar switch that allows multiple SMs to access the L2 cache and DRAM simultaneously.

Memory Controllers

The memory controllers handle all read and write requests to the GPU's DRAM. They are responsible for optimizing DRAM access patterns to maximize bandwidth utilization. Modern GPUs have very wide DRAM interfaces (256-bit to 4096-bit) and support high-bandwidth memory technologies like GDDR6 and HBM2.

RT Cores and Tensor Cores

Modern NVIDIA GPUs also include specialized hardware units for accelerating ray tracing (RT Cores) and AI/deep learning (Tensor Cores). RT Cores accelerate bounding volume hierarchy (BVH) traversal and ray-triangle intersection tests, while Tensor Cores provide high-throughput matrix multiplication and convolution operations.

These hardware components work together to enable GPUs to achieve very high compute throughput and memory bandwidth, making them well-suited for parallel workloads in graphics, HPC, and AI. The highly parallel architecture and specialized hardware units of modern GPUs allow them to deliver performance that is orders of magnitude higher than general-purpose CPUs on certain workloads.

A brief history of GPUs

The history of GPUs can be traced back to the early days of 3D graphics acceleration in the 1990s:

1990s: Early 3D accelerators like 3dfx Voodoo and NVIDIA RIVA TNT started appearing in the mid-1990s to offload 3D graphics rendering from the CPU. These were fixed-function devices optimized for a specific set of graphics APIs and lacked programmability.
1999: NVIDIA introduced the GeForce 256, the first GPU to implement hardware transform and lighting (T&L) in addition to the standard 3D rendering pipeline. It could process 10 million polygons per second, a major milestone in consumer graphics performance.
2001: NVIDIA launched the GeForce 3, which introduced programmable vertex and pixel shading, opening the door for more realistic and dynamic visual effects. This marked the beginning of the transition from fixed-function to programmable graphics pipelines.
2006: The release of NVIDIA's GeForce 8800 GTX marked a major turning point, as it was the first GPU to support the CUDA programming model, enabling developers to use the GPU for general-purpose computing (GPGPU) beyond just graphics. It featured 128 CUDA cores and could achieve over 500 GFLOPS of performance.
2008: Apple, AMD, Intel, and NVIDIA formed the OpenCL working group to develop an open standard for parallel programming on heterogeneous systems. OpenCL provided a vendor-agnostic alternative to CUDA, although CUDA remained the most widely used GPGPU platform.
2010: NVIDIA launched the Fermi architecture, which featured up to 512 CUDA cores, a unified L1/L2 cache hierarchy, ECC memory support, and improved double precision performance. This made GPUs viable for a wider range of HPC and scientific computing applications.
2016: NVIDIA introduced the Pascal architecture with the Tesla P100, which featured high-bandwidth HBM2 memory, up to 3584 CUDA cores, and specialized FP16 cores for deep learning. The P100 could deliver over 10 TFLOPS of performance, cementing GPUs as the platform of choice for AI training.
2018: NVIDIA launched the Turing architecture, which introduced RT Cores for real-time ray tracing and Tensor Cores for accelerated AI inference. Turing marked a significant milestone in GPU architecture, as it expanded the GPU's capabilities beyond just rasterization and GPGPU to include advanced rendering techniques and AI acceleration.

Conclusion

Over the past two decades, GPUs have evolved from fixed-function graphics accelerators to highly programmable, energy-efficient computing engines that play a critical role in a wide range of applications from gaming and visualization to high performance computing and artificial intelligence. Key architectural innovations that have enabled this transformation include:

The introduction of programmable shading with support for branching and looping
Unified shader architectures that allow the same processing units to be used for different shading tasks
The addition of support for general-purpose programming models like CUDA and OpenCL
Increasing energy efficiency through extensive use of multithreading to hide memory latency and keep arithmetic units utilized
Continued improvements in memory bandwidth and the introduction of high-bandwidth memory technologies like GDDR6 and HBM2
The incorporation of fixed-function units for ray tracing and tensor processing to accelerate rendering and AI workloads

As we look to the future, it's clear that specialization and heterogeneous computing will continue to be key drivers for improving performance and efficiency. GPUs are well-positioned to remain at the forefront of these trends given their heritage of energy-efficient parallel processing and their ability to incorporate domain-specific functionality while maintaining general-purpose programmability. Techniques like chiplet-based designs and advanced packaging technologies will allow GPUs to scale to even higher levels of performance and integrate even more functionality over time.

At the same time, the applicability of GPU acceleration continues to grow as more and more workloads in scientific computing, data analytics, and machine learning exhibit the kind of fine-grained parallelism that GPUs excel at. With their ability to accelerate these and other emerging applications, GPUs are poised to play an increasingly important role in driving future advancements in computing. Understanding their architecture is key to unlocking their full potential.

Chapter 10 Intractable Problems and Approximation Algorithms Chapter 2 Gpu Rogramming Models