← Back to Hardware

How GPU Works: Graphics vs Compute Architecture Explained

A GPU (Graphics Processing Unit) is a specialized processor designed to handle thousands of simple calculations in parallel. Modern GPUs excel at both rendering graphics and accelerating compute-intensive workloads like machine learning and scientific simulations through massively parallel execution on thousands of cores.

Why GPU Architecture Differs from CPUs

CPUs prioritize speed and versatility. They've got fewer cores—typically 4 to 64—but each one's powerful and can handle complex, sequential instructions. A CPU's focus on latency makes it ideal for branching logic, interrupts, and general-purpose computing.

GPUs take the opposite approach. They're built around thousands of simpler cores optimized for throughput, not latency. This design shines when you've got massive datasets and identical operations to perform across them. Your GPU might have 2,048 to 10,240+ cores, each capable of executing the same instruction on different data simultaneously.

This fundamental difference explains why GPUs struggled with graphics in the 1980s and 1990s (they didn't exist), why CPUs still dominate general computing, and why GPUs now power AI training, video encoding, and financial modeling.

The Graphics Pipeline: How GPUs Render Frames

When you play a game or render a 3D scene, your GPU executes what's called the graphics pipeline. This isn't a single operation—it's a series of specialized stages that transform 3D vertex data into 2D pixels on your screen.

Vertex Shading

Your GPU receives vertex data (3D coordinates, colors, texture information). Thousands of GPU cores execute the same vertex shader program on different vertices in parallel. Each core transforms a vertex's position based on camera angle, lighting, and animation data. Since vertices are independent, this parallelization is natural and efficient.

Rasterization

After vertices are transformed, the GPU converts them into triangles and determines which pixels those triangles cover. This step is highly parallelizable—checking millions of pixels against triangle boundaries happens simultaneously across GPU cores.

Fragment Shading

For each pixel covered by a triangle, the GPU runs a fragment shader (also called a pixel shader). This determines the pixel's final color based on textures, lighting calculations, and material properties. Again, since pixels are independent, thousands run their shaders concurrently.

Memory Operations and Output

Fragment shaders write results to the framebuffer—GPU memory that directly feeds your display. Modern GPUs perform depth testing, blending, and anti-aliasing before writing final pixel values.

The entire pipeline leverages GPU parallelism beautifully. A CPU would process vertices sequentially; a GPU processes millions per frame in parallel. This is why your GPU handles graphics rendering and your CPU handles physics, AI, and game logic.

GPU Memory Hierarchy and Bandwidth

A GPU's performance hinges on memory bandwidth. Your GPU cores run incredibly fast—a single NVIDIA RTX 4090 has ~16,000 CUDA cores. But if data can't reach those cores quickly, they'll idle waiting.

Modern GPUs implement a memory hierarchy similar to CPUs but optimized for bandwidth:

Registers: Ultra-fast, private to each core. Thousands exist per GPU.
Shared Memory (L1 Cache): Fast, shared within a thread block. 96KB–192KB per multiprocessor.
L2 Cache: Larger, shared across entire GPU. Hundreds of KB to a few MB.
Global Memory (VRAM): High-capacity, high-latency. 4GB–24GB+ on consumer cards. Bandwidth is critical here.

An RTX 4090 provides 936 GB/s of memory bandwidth compared to a CPU's ~100 GB/s. That's roughly 10x more data flowing per second. This matters enormously when processing datasets—a GPU can feed its cores with data far faster than a CPU can.

Compute Mode: General-Purpose GPU Computing

Compute workloads use GPUs differently than graphics. Instead of rendering pixels, GPUs accelerate arbitrary numerical computations: matrix multiplication, convolutions, sorting, physics simulations.

NVIDIA's CUDA (Compute Unified Device Architecture) lets developers write C-like kernels that run on GPU cores. Here's what that looks like:

__global__ void matmul_kernel(float *C, float *A, float *B, int n) {
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  
  if (row < n && col < n) {
    float sum = 0.0f;
    for (int k = 0; k < n; k++) {
      sum += A[row * n + k] * B[k * n + col];
    }
    C[row * n + col] = sum;
  }
}

This kernel multiplies matrices. Each thread computes one element of the result matrix independently. Launch it with thousands of threads, and you're multiplying millions of elements in parallel. A CPU version would loop sequentially through indices. The GPU version does thousands at once.

Thread Hierarchy

CUDA organizes parallel execution into a hierarchy. A kernel launch creates a grid of thread blocks, and each block contains hundreds of threads. Threads within a block share memory and can synchronize. Blocks run independently, allowing thousands to execute simultaneously.

Graphics pipelines are optimized for data independence (vertex shading, fragment shading). Compute kernels can be more flexible—threads can communicate via shared memory, perform reductions, and synchronize. This flexibility enables algorithms beyond graphics, like reductions, scans, and sorting.

Key Architectural Differences Between Graphics and Compute

Instruction Cohesion

In graphics mode, all pixels execute the same shader code. GPUs assume high instruction cohesion—thousands of cores executing identical instructions on different data (SIMD: Single Instruction, Multiple Data).

Compute kernels allow branches, loops, and divergent execution. If threads in a warp (32 threads on NVIDIA hardware) take different code paths, they execute serially, reducing throughput. Well-written compute code minimizes divergence.

Memory Access Patterns

Graphics workloads read from textures in spatially coherent patterns. GPU texture units are optimized for these access patterns with specialized caching.

Compute workloads access global memory directly. Coalesced memory accesses (where consecutive threads read consecutive memory locations) are critical for performance. Poor memory patterns can reduce bandwidth by 10x.

Synchronization Needs

Graphics pipeline stages are implicitly ordered. Vertex shading completes before rasterization; rasterization before fragment shading. Synchronization happens automatically.

Compute kernels may require explicit synchronization. Threads within a block can call __syncthreads() to wait until all reach that point. Cross-block synchronization requires kernel boundaries.

Modern GPU Use Cases

Graphics (Traditional)

Gaming, film rendering, CAD applications, and real-time visualization. GPUs here execute graphics pipelines, rendering 60–240 frames per second on modern hardware.

Machine Learning

Deep learning frameworks (PyTorch, TensorFlow) use GPUs for training and inference. Matrix multiplications dominate neural network computations. A single RTX 4090 provides ~1,450 TFLOPS of FP32 performance—roughly 40–50x a high-end CPU. This is why serious ML work happens on GPUs.

Data Science & Analytics

Libraries like RAPIDS accelerate data processing. Sorting, filtering, and aggregating massive datasets runs faster on GPU memory with parallel algorithms.

Scientific Computing

Physics simulations, climate modeling, and molecular dynamics benefit enormously from GPU acceleration. Iterative solvers, FFTs, and particle simulations map naturally to parallel GPU execution.

Video Encoding

H.264, H.265, and VP9 encoding use GPU acceleration. Motion estimation and transform operations parallelize across GPU cores, enabling real-time encoding of 4K video.

Choosing Between GPU and CPU

Use a GPU when:

You're processing massive datasets with identical operations
Memory bandwidth, not latency, limits your application
Your algorithm has high parallelism (>10,000 independent operations)
Data transfer overhead won't dominate execution time

Stick with CPU when:

Your algorithm's sequential or has complex branching
Data is small (GPU transfer overhead kills gains)
You need low latency for single operations
Your code rarely changes and CPU performance suffices

Modern systems often use both. Your CPU orchestrates logic; your GPU accelerates compute-heavy stages. This heterogeneous approach combines CPU flexibility with GPU throughput.

Understanding GPU Limitations

Memory Transfer Overhead

PCIe bandwidth (32 GB/s on PCIe 4.0) is slower than GPU memory bandwidth (900+ GB/s). If your kernel runs briefly, transfer time dominates. You must keep data on GPU and perform multiple operations to amortize transfer cost.

Synchronization Overhead

Launching kernels, copying memory, and reading results back involves CPU-GPU synchronization. Many small kernels run slower than one large kernel doing the same work.

Precision Trade-offs

GPUs excel at low-precision (FP16, INT8) operations. Higher precision (FP64) runs slower on most GPUs. Graphics never needed double precision; compute applications sometimes do.

Frequently Asked Questions

Can a GPU replace a CPU?

No. GPUs excel at parallel compute workloads; CPUs handle sequential logic, branching, and general-purpose tasks. Modern systems use both. Your CPU manages OS, I/O, and decision-making; your GPU accelerates matrix math and data processing.

Why are graphics and compute so different on GPUs?

Graphics pipelines are fixed-function: vertex shading, rasterization, fragment shading in strict order. Compute is flexible: arbitrary kernels with custom synchronization. A GPU's fixed graphics pipeline is optimized for rendering; compute flexibility allows diverse algorithms. Modern GPUs support both modes.

What's a CUDA core, and how do they differ from CPU cores?

A CUDA core is a simple arithmetic logic unit capable of basic operations (add, multiply, etc.). An RTX 4090 has ~16,000 CUDA cores versus a CPU's 8–16 cores. CUDA cores are simpler and slower individually but execute identically across thousands in parallel. A CPU core is more complex, can handle diverse instruction types, and excels at serial execution.

Is VRAM the same as system RAM?

No. VRAM (GPU memory) is physically separate from system RAM. VRAM is connected directly to the GPU via a high-bandwidth interconnect (NVLink on high-end cards). System RAM connects via PCIe. Copying data between them is slow relative to on-GPU operations. This is why you want to minimize PCIe transfers in compute workloads.

Related reading: Learn about CPU architecture basics to compare with GPU design, or explore memory systems and caching strategies for deeper performance understanding. For practical GPU programming, check out CUDA fundamentals and parallel computing introduction.