A GPU (Graphics Processing Unit) is a specialized processor designed to handle thousands of simple calculations in parallel. Modern GPUs excel at both rendering graphics and accelerating compute-intensive workloads like machine learning and scientific simulations through massively parallel execution on thousands of cores.
CPUs prioritize speed and versatility. They've got fewer cores—typically 4 to 64—but each one's powerful and can handle complex, sequential instructions. A CPU's focus on latency makes it ideal for branching logic, interrupts, and general-purpose computing.
GPUs take the opposite approach. They're built around thousands of simpler cores optimized for throughput, not latency. This design shines when you've got massive datasets and identical operations to perform across them. Your GPU might have 2,048 to 10,240+ cores, each capable of executing the same instruction on different data simultaneously.
This fundamental difference explains why GPUs struggled with graphics in the 1980s and 1990s (they didn't exist), why CPUs still dominate general computing, and why GPUs now power AI training, video encoding, and financial modeling.
When you play a game or render a 3D scene, your GPU executes what's called the graphics pipeline. This isn't a single operation—it's a series of specialized stages that transform 3D vertex data into 2D pixels on your screen.
Your GPU receives vertex data (3D coordinates, colors, texture information). Thousands of GPU cores execute the same vertex shader program on different vertices in parallel. Each core transforms a vertex's position based on camera angle, lighting, and animation data. Since vertices are independent, this parallelization is natural and efficient.
After vertices are transformed, the GPU converts them into triangles and determines which pixels those triangles cover. This step is highly parallelizable—checking millions of pixels against triangle boundaries happens simultaneously across GPU cores.
For each pixel covered by a triangle, the GPU runs a fragment shader (also called a pixel shader). This determines the pixel's final color based on textures, lighting calculations, and material properties. Again, since pixels are independent, thousands run their shaders concurrently.
Fragment shaders write results to the framebuffer—GPU memory that directly feeds your display. Modern GPUs perform depth testing, blending, and anti-aliasing before writing final pixel values.
The entire pipeline leverages GPU parallelism beautifully. A CPU would process vertices sequentially; a GPU processes millions per frame in parallel. This is why your GPU handles graphics rendering and your CPU handles physics, AI, and game logic.
A GPU's performance hinges on memory bandwidth. Your GPU cores run incredibly fast—a single NVIDIA RTX 4090 has ~16,000 CUDA cores. But if data can't reach those cores quickly, they'll idle waiting.
Modern GPUs implement a memory hierarchy similar to CPUs but optimized for bandwidth:
An RTX 4090 provides 936 GB/s of memory bandwidth compared to a CPU's ~100 GB/s. That's roughly 10x more data flowing per second. This matters enormously when processing datasets—a GPU can feed its cores with data far faster than a CPU can.
Compute workloads use GPUs differently than graphics. Instead of rendering pixels, GPUs accelerate arbitrary numerical computations: matrix multiplication, convolutions, sorting, physics simulations.
NVIDIA's CUDA (Compute Unified Device Architecture) lets developers write C-like kernels that run on GPU cores. Here's what that looks like:
__global__ void matmul_kernel(float *C, float *A, float *B, int n) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < n && col < n) {
float sum = 0.0f;
for (int k = 0; k < n; k++) {
sum += A[row * n + k] * B[k * n + col];
}
C[row * n + col] = sum;
}
}
This kernel multiplies matrices. Each thread computes one element of the result matrix independently. Launch it with thousands of threads, and you're multiplying millions of elements in parallel. A CPU version would loop sequentially through indices. The GPU version does thousands at once.
CUDA organizes parallel execution into a hierarchy. A kernel launch creates a grid of thread blocks, and each block contains hundreds of threads. Threads within a block share memory and can synchronize. Blocks run independently, allowing thousands to execute simultaneously.
Graphics pipelines are optimized for data independence (vertex shading, fragment shading). Compute kernels can be more flexible—threads can communicate via shared memory, perform reductions, and synchronize. This flexibility enables algorithms beyond graphics, like reductions, scans, and sorting.
In graphics mode, all pixels execute the same shader code. GPUs assume high instruction cohesion—thousands of cores executing identical instructions on different data (SIMD: Single Instruction, Multiple Data).
Compute kernels allow branches, loops, and divergent execution. If threads in a warp (32 threads on NVIDIA hardware) take different code paths, they execute serially, reducing throughput. Well-written compute code minimizes divergence.
Graphics workloads read from textures in spatially coherent patterns. GPU texture units are optimized for these access patterns with specialized caching.
Compute workloads access global memory directly. Coalesced memory accesses (where consecutive threads read consecutive memory locations) are critical for performance. Poor memory patterns can reduce bandwidth by 10x.
Graphics pipeline stages are implicitly ordered. Vertex shading completes before rasterization; rasterization before fragment shading. Synchronization happens automatically.
Compute kernels may require explicit synchronization. Threads within a block can call __syncthreads() to wait until all reach that point. Cross-block synchronization requires kernel boundaries.
Gaming, film rendering, CAD applications, and real-time visualization. GPUs here execute graphics pipelines, rendering 60–240 frames per second on modern hardware.
Deep learning frameworks (PyTorch, TensorFlow) use GPUs for training and inference. Matrix multiplications dominate neural network computations. A single RTX 4090 provides ~1,450 TFLOPS of FP32 performance—roughly 40–50x a high-end CPU. This is why serious ML work happens on GPUs.
Libraries like RAPIDS accelerate data processing. Sorting, filtering, and aggregating massive datasets runs faster on GPU memory with parallel algorithms.
Physics simulations, climate modeling, and molecular dynamics benefit enormously from GPU acceleration. Iterative solvers, FFTs, and particle simulations map naturally to parallel GPU execution.
H.264, H.265, and VP9 encoding use GPU acceleration. Motion estimation and transform operations parallelize across GPU cores, enabling real-time encoding of 4K video.
Use a GPU when:
Stick with CPU when:
Modern systems often use both. Your CPU orchestrates logic; your GPU accelerates compute-heavy stages. This heterogeneous approach combines CPU flexibility with GPU throughput.
PCIe bandwidth (32 GB/s on PCIe 4.0) is slower than GPU memory bandwidth (900+ GB/s). If your kernel runs briefly, transfer time dominates. You must keep data on GPU and perform multiple operations to amortize transfer cost.
Launching kernels, copying memory, and reading results back involves CPU-GPU synchronization. Many small kernels run slower than one large kernel doing the same work.
GPUs excel at low-precision (FP16, INT8) operations. Higher precision (FP64) runs slower on most GPUs. Graphics never needed double precision; compute applications sometimes do.
No. GPUs excel at parallel compute workloads; CPUs handle sequential logic, branching, and general-purpose tasks. Modern systems use both. Your CPU manages OS, I/O, and decision-making; your GPU accelerates matrix math and data processing.
Graphics pipelines are fixed-function: vertex shading, rasterization, fragment shading in strict order. Compute is flexible: arbitrary kernels with custom synchronization. A GPU's fixed graphics pipeline is optimized for rendering; compute flexibility allows diverse algorithms. Modern GPUs support both modes.
A CUDA core is a simple arithmetic logic unit capable of basic operations (add, multiply, etc.). An RTX 4090 has ~16,000 CUDA cores versus a CPU's 8–16 cores. CUDA cores are simpler and slower individually but execute identically across thousands in parallel. A CPU core is more complex, can handle diverse instruction types, and excels at serial execution.
No. VRAM (GPU memory) is physically separate from system RAM. VRAM is connected directly to the GPU via a high-bandwidth interconnect (NVLink on high-end cards). System RAM connects via PCIe. Copying data between them is slow relative to on-GPU operations. This is why you want to minimize PCIe transfers in compute workloads.
Related reading: Learn about CPU architecture basics to compare with GPU design, or explore memory systems and caching strategies for deeper performance understanding. For practical GPU programming, check out CUDA fundamentals and parallel computing introduction.