Fork-join parallelism, work and span, greedy scheduling, and the art of going fast together.
For forty years, software engineers had a free lunch. Every 18 months, clock speeds doubled. Your program ran twice as fast without you changing a single line of code. Moore's Law delivered performance on a silver platter. The Intel 4004 (1971) ran at 740 kHz. The Pentium 4 (2004) hit 3.8 GHz — a 5,000× increase in clock speed over 33 years.
Then, around 2005, the lunch ended. Clock speeds hit a wall near 4 GHz. Physics intervened: pushing electrons faster through silicon generates heat proportional to the cube of the frequency. A 4 GHz chip dissipates roughly 100 watts. Doubling to 8 GHz would require 800 watts — enough to melt the chip. This is called the power wall, and it ended the era of single-core speedups.
Intel, AMD, and ARM did not give up on performance — they changed strategy. Instead of making one core faster, they put multiple cores on the same chip. Your laptop today has 8 to 16 cores. A GPU has thousands. A modern data center node might have 128 CPU cores and 8 GPUs with 80,000 CUDA cores total.
But here is the uncomfortable truth: a 16-core machine running a single-threaded program uses exactly one core. The other 15 sit idle, burning power and generating heat while doing absolutely nothing useful. To use them, you must rewrite your algorithm to do multiple things at once. That is parallel computing, and it is no longer optional — it is the only path to faster programs.
The stakes are enormous. Machine learning models like GPT-4 would take over 100 years to train on a single GPU. Weather simulations that predict hurricanes 5 days in advance require trillions of floating-point operations per second. Protein folding, drug discovery, autonomous vehicles — every hard computational problem today is solved by parallel algorithms running on parallel hardware. If you write sequential code, you are voluntarily handicapping yourself to 1/16th (or 1/1000th on a GPU) of your machine's capacity.
The model we will study is fork-join parallelism, also called the dynamic multithreading model. The idea is simple:
This is the same pattern as divide-and-conquer, but with a twist: the recursive calls run simultaneously instead of one after the other. Merge sort becomes parallel merge sort by spawning the two recursive calls. The merge step still waits for both halves to finish (sync), then combines them.
The model makes a simplifying assumption: we have an ideal scheduler that can assign any ready task to any idle processor in O(1) time, and memory access is uniform (any processor can read any memory location at equal cost). Real hardware violates both assumptions (schedulers are not free; NUMA and caches create non-uniform access), but the model gives tight performance bounds that hold remarkably well in practice.
The simulation below shows the difference. On the left, a sequential program executes tasks one at a time. On the right, a fork-join program overlaps independent tasks. Watch how the total time drops even though the total work stays the same.
Left: tasks execute one by one. Right: independent tasks run in parallel. Same total work, less wall-clock time. Adjust processor count to see the effect.
CLRS uses the Cilk concurrency model from MIT (later commercialized by Intel). Cilk extends a serial language with exactly three keywords:
| Keyword | Meaning | Analogy |
|---|---|---|
| spawn | Launch a function call as a parallel task. The caller continues immediately without waiting. | "Hey friend, sort this half while I sort mine." |
| sync | Wait for all previously spawned tasks to complete before continuing. | "Wait for my friend to finish before I merge." |
| parallel for | Execute all iterations of a loop in parallel. Syntactic sugar for spawn + sync on each iteration. | "Everyone take one exam and grade it." |
The beauty of this model is that a Cilk program with all spawn/sync removed is a valid serial program that produces the correct answer. The parallel keywords are performance hints, not correctness requirements. This property — serial equivalence — makes reasoning about correctness much easier than traditional threading.
The key questions this chapter answers are: How fast can a parallel algorithm go? What is the theoretical limit? How do we schedule tasks across processors efficiently? And most importantly: What goes wrong when parallel tasks step on each other's toes?
To reason about parallel algorithms, we need two numbers that capture everything important about their performance. Forget about the number of processors for a moment — the algorithm itself has intrinsic properties that determine how parallelizable it is.
Work, written T1, is the total number of operations the algorithm performs. It is exactly the running time on a single processor — the serial time. Think of it as the total amount of "stuff to do." If your algorithm does n additions and n multiplications, the work is 2n regardless of how many processors you have. Work is a property of the algorithm, not the hardware. A 1-core machine takes T1 time. A 100-core machine still does T1 total operations — it just spreads them across cores.
Span, written T∞, is the longest chain of sequentially dependent operations — the critical path. It is the time the algorithm would take with an infinite number of processors. Even if you had a million cores, you cannot go faster than the span, because the operations along the critical path must happen one after another. The subscript ∞ reminds us: this is the time with infinite processors. In the DAG picture, the span is the length of the longest path from any source to any sink.
The ratio T1 / T∞ is called the parallelism. It tells you the maximum useful number of processors. If work = 1000 and span = 10, the parallelism is 100 — adding more than 100 processors gives no further speedup. Beyond 100, extra processors sit idle because there are not enough independent tasks to keep them busy. This is the algorithm's "parallelizability score."
Some algorithms have enormous parallelism. Parallel matrix multiply has parallelism Θ(n³/log n) — for 1000×1000 matrices, that is about 100 million. Other algorithms have pitiful parallelism. Sequential scan (prefix sum, without the Blelloch trick) has parallelism 1 — completely serial. The goal of parallel algorithm design is to maximize parallelism while keeping work efficient.
The running time on P processors, TP, is bounded from below by two constraints:
The first term says you cannot do better than dividing the work equally. The second says you cannot beat the critical path. A good parallel algorithm minimizes both: low span (short critical path) and efficient work (not much more than the serial algorithm).
Speedup on P processors is T1 / TP. If TP = T1 / P, we have linear speedup — the ideal case. If TP = T1 · 1 (no improvement), speedup is 1. Perfect parallelism gives speedup equal to P.
Every parallel computation can be represented as a directed acyclic graph (DAG). Each node represents a unit of work (an instruction, a function call, a loop body). Each edge represents a dependency: "this must finish before that can start." The DAG captures everything about the parallel structure of the algorithm.
There are three types of edges in a fork-join DAG:
| Edge Type | What It Represents | Direction |
|---|---|---|
| Continuation | Sequential control flow: one instruction after another | Within the same thread, top to bottom |
| Spawn | A parent spawns a child task. The child begins; the parent continues. | Parent → child (creating a new branch) |
| Join (sync) | A task waits for its spawned children to finish | Child → sync point in parent |
The work T1 is the total number of nodes (summing all node weights if nodes have different costs). The span T∞ is the weight of the longest path from any source node (no predecessors) to any sink node (no successors). Finding the critical path is equivalent to finding the longest path in a DAG, which can be done in O(V + E) time using topological sort.
The simulation below shows a computation DAG (directed acyclic graph). Each node is an operation, each edge is a dependency. The work is the total number of nodes. The span is the length of the longest path from source to sink, highlighted in orange. Adjust the DAG shape and see how work and span change.
Each circle is one operation. Arrows show dependencies. Orange path = critical path (span). Total nodes = work. Adjust the branching factor to see how parallelism changes.
A parallel algorithm is work-efficient if its work matches the best serial algorithm. For example, parallel merge sort does O(n log n) work — the same as serial merge sort. It is work-efficient. Some parallel algorithms trade extra work for shorter span. That tradeoff is acceptable only if the parallelism is high enough to compensate.
Let us trace through the parallel Fibonacci from Chapter 0 with n = 5.
The computation DAG for FIB(5) has FIB(5) at the root, which spawns FIB(4) and calls FIB(3). FIB(4) spawns FIB(3) and calls FIB(2). And so on, down to the base cases FIB(0) and FIB(1).
When is adding more processors actually useful? When the parallelism T1/T∞ is much larger than P. Specifically, if T1/T∞ ≫ P (there is sufficient slackness), then the greedy scheduling bound T1/P + T∞ is dominated by T1/P, and we get near-linear speedup:
Rule of thumb: you want parallelism to be at least 10× the processor count to achieve near-linear speedup. If parallelism is 100 and P is 10, you are in good shape. If parallelism is 10 and P is 10, much of the potential is wasted on critical path overhead.
| Measure | Symbol | Meaning | Analogy |
|---|---|---|---|
| Work | T1 | Total operations (serial time) | Total person-hours to build a house |
| Span | T∞ | Critical path (infinite processors) | Minimum time even with unlimited workers |
| Parallelism | T1/T∞ | Max useful processors | How many workers can be productive at once |
| Speedup | T1/TP | Factor of improvement on P processors | How much faster the house gets built |
Merge sort is the poster child of divide-and-conquer. It splits the array in half, recursively sorts each half, and merges the results. Making the two recursive calls parallel is trivial — just spawn them. But the merge step is the bottleneck.
The obvious approach: spawn the two recursive sorts and sync before merging.
The work is O(n log n) — same as serial merge sort. But what about the span?
The two recursive calls run in parallel, so the span of the recursion is T∞(n) = T∞(n/2) + Θ(n). The Θ(n) is the sequential merge. Solving: T∞ = Θ(n). That is terrible — the span equals the work order, giving parallelism of only O(log n). The merge is the bottleneck.
The key insight: we can merge two sorted arrays in parallel using a median-splitting strategy. Here is how it works:
Each recursive call handles at most 3n/4 elements (since we split the larger array in half and the smaller array at some point). The span recurrence is:
By the master theorem (or Akra-Bazzi), this solves to T∞ = Θ(log2 n). The work for parallel merge is Θ(n) — same as sequential merge (work-efficient).
With the parallel merge, the span of the full merge sort becomes:
The work remains O(n log n). So the parallelism is:
For n = 10 million, that is about 70,000 — meaning we can productively use tens of thousands of processors. That is real parallelism.
| Algorithm | Work T1 | Span T∞ | Parallelism |
|---|---|---|---|
| Serial merge sort | Θ(n log n) | Θ(n log n) | 1 |
| Naive parallel MS | Θ(n log n) | Θ(n) | Θ(log n) |
| Parallel MS (parallel merge) | Θ(n log n) | Θ(log3 n) | Θ(n / log2 n) |
Watch the fork-join structure. Teal = spawn (fork). Orange = sync (join). Red path = critical path. Compare naive merge (sequential) vs parallel merge.
This bound is crucial and often confuses students. When we pick the median of the larger array T (size m), we split T exactly in half: m/2 elements on each side. The binary search in the smaller array S (size n-m) splits S at some point, say j elements left and (n-m-j) right.
The larger subproblem gets at most m/2 + (n-m) elements. Since m ≥ n-m (T is larger), m/2 + (n-m) ≤ m/2 + m = 3m/2. Since m ≤ n, this is at most 3n/4. Each recursive call handles at most 3/4 of the total elements. This geometric shrinkage is what gives us the Θ(log2 n) span.
python import threading def parallel_merge(T, S, output, t_lo, t_hi, s_lo, s_hi, out_lo): """Merge sorted T[t_lo:t_hi] and S[s_lo:s_hi] into output[out_lo:]""" n_t = t_hi - t_lo n_s = s_hi - s_lo if n_t < n_s: # ensure T is the larger array return parallel_merge(S, T, output, s_lo, s_hi, t_lo, t_hi, out_lo) if n_t == 0: return if n_t + n_s <= 64: # base case: serial merge for small inputs serial_merge(T, S, output, t_lo, t_hi, s_lo, s_hi, out_lo) return mid_t = (t_lo + t_hi) // 2 x = T[mid_t] mid_s = bisect_left(S, x, s_lo, s_hi) # O(log n) binary search out_mid = out_lo + (mid_t - t_lo) + (mid_s - s_lo) output[out_mid] = x # Spawn two independent merges left = threading.Thread(target=parallel_merge, args=(T, S, output, t_lo, mid_t, s_lo, mid_s, out_lo)) left.start() parallel_merge(T, S, output, mid_t+1, t_hi, mid_s, s_hi, out_mid+1) left.join() # sync
Matrix multiplication is the backbone of scientific computing, machine learning, and graphics. Every time a neural network does a forward pass, it multiplies matrices. The standard algorithm is O(n³) for n × n matrices. How much can parallelism help?
Recall: C = A × B where C[i][j] = ∑k A[i][k] · B[k][j]. The triple nested loop:
Let us be concrete about what flows where. Matrix A is n × n (say 1024 × 1024 of float32, that is 4 MB). Matrix B is n × n (another 4 MB). Output C is n × n (4 MB, initialized to zeros). The triple loop visits n³ = 1,073,741,824 multiply-add operations. On a modern CPU doing 64 GFLOPS, this takes about 16 ms. On a GPU doing 100 TFLOPS, it takes about 10 μs. The difference is pure parallelism.
The simplest parallel approach: the n² output entries C[i][j] are independent of each other. Each one is a dot product of row i of A with column j of B. Spawn all n² dot products in parallel.
Each dot product takes Θ(n) work and Θ(log n) span (using parallel reduction to sum n products). The full analysis:
This is spectacularly parallel. For 1000 × 1000 matrices, the parallelism is about 108. The span is only O(log n) because all the dot products run simultaneously, and each dot product's sum is a parallel reduction tree of depth log n.
CLRS presents a recursive approach: partition each n × n matrix into four n/2 × n/2 blocks and perform 8 recursive multiplications plus 4 additions. The recurrence is T(n) = 8T(n/2) + Θ(n²), giving T(n) = Θ(n³) serial.
In parallel, spawn all 8 sub-multiplies. Since the additions must wait for the multiplies (sync), the span is:
Wait — actually, the 8 recursive calls are spawned, but pairs that write to the same output quadrant must be serialized (they add to the same block). So we group them: 4 pairs, each pair sequential. Span becomes T∞(n) = 2 · T∞(n/2) + Θ(log n), which solves to Θ(n).
To fix this, we can use a temporary matrix to accumulate partial products, then add in parallel. With care, the span drops to Θ(log2 n). Strassen's algorithm reduces the 8 multiplications to 7, giving work Θ(nlog27) ≈ Θ(n2.81) while maintaining excellent parallelism.
Each cell C[i][j] is an independent dot product. Watch how P processors divide the n² cells among themselves. Drag the slider to change matrix size.
| Method | Work | Span | Parallelism |
|---|---|---|---|
| Loop parallelism | Θ(n³) | Θ(log n) | Θ(n³/log n) |
| D&C parallel | Θ(n³) | Θ(log2 n) | Θ(n³/log2 n) |
| Strassen parallel | Θ(n2.81) | Θ(log2 n) | Θ(n2.81/log2 n) |
python import numpy as np from concurrent.futures import ThreadPoolExecutor def parallel_matmul(A, B, n_workers=4): """Parallel matrix multiply: each worker computes a block of rows.""" n = A.shape[0] C = np.zeros((n, n)) chunk = n // n_workers def compute_rows(start): end = min(start + chunk, n) C[start:end] = A[start:end] @ B # each worker writes to disjoint rows with ThreadPoolExecutor(max_workers=n_workers) as pool: futures = [pool.submit(compute_rows, i * chunk) for i in range(n_workers)] for f in futures: f.result() # sync: wait for all workers return C # No races: each worker writes to disjoint rows of C # Work: O(n³) same as serial. Span: O(n³/P) per worker = O(n²·n/P)
Parallelism gives speed, but it also creates a new class of bugs that do not exist in sequential programs. The most insidious is the race condition: when two parallel tasks access the same memory location, and at least one writes to it, the result depends on who gets there first.
Suppose two threads both execute x = x + 1 where x starts at 0. You expect x = 2 afterward. But the operation x = x + 1 is not atomic — it involves three steps:
If both threads read x = 0 before either writes, they both compute 0 + 1 = 1, and both write 1. The final value is 1 instead of 2. One increment was lost. This is a determinacy race: the program produces different results depending on the scheduling order.
Let us trace the two possible interleavings explicitly:
| Type | Description | Example |
|---|---|---|
| Read-Write race | One thread reads while another writes the same location | Thread A reads x, Thread B writes x |
| Write-Write race | Two threads write to the same location | Both threads do x = something |
| Read-Read | Two threads read the same location | Not a race — reads are safe |
CLRS uses the Cilk concurrency model (from MIT, later adopted by Intel). In Cilk, a program is deterministic if its parallel execution always produces the same result as its serial execution. A program with no determinacy races is guaranteed to be deterministic.
The rule is simple: if a spawned task and its continuation (the code that runs after the spawn) both access the same variable, and at least one writes, that is a race. The fix? Either make sure they access disjoint memory, or put the access after the sync.
The simulation below shows a race condition in action. Two threads try to increment a shared counter 5 times each. Watch the interleaving — some increments are lost because both threads read the old value before either writes.
Two threads each increment a shared counter. Watch for red conflicts where both read the same value — one increment is lost. Expected result: 10. Actual result: usually less.
How do you know if your parallel program has a race? CLRS gives a clean definition: a program has a determinacy race if two logically parallel instructions access the same memory location and at least one is a write. "Logically parallel" means there is no spawn-sync path ordering them — the scheduler could execute them in either order.
The Cilk runtime includes a race detector called Cilkscreen (analogous to ThreadSanitizer for C++ or -race flag for Go). It runs the program serially but tracks which memory locations are accessed by which logical tasks, flagging any pair that could conflict. The overhead is about 4-6× slowdown, which is acceptable for testing but not production.
| Pattern | Description | Example |
|---|---|---|
| Disjoint writes | Each task writes to a separate region | Task i fills output[i*chunk..(i+1)*chunk] |
| Reduce after sync | Each task accumulates locally, merge after sync | Local counters merged at sync point |
| Copy-in / copy-out | Each task gets its own copy of input data | Each merge sort call gets its own buffer |
| Functional style | No mutation; return new values | Recursive fib returns values, never writes shared state |
python import threading counter = 0 lock = threading.Lock() def unsafe_increment(): global counter for _ in range(100000): counter += 1 # RACE: read-modify-write is not atomic def safe_increment(): global counter for _ in range(100000): with lock: # mutex ensures mutual exclusion counter += 1 # Unsafe: result varies (often ~150,000 instead of 200,000) threads = [threading.Thread(target=unsafe_increment) for _ in range(2)] for t in threads: t.start() for t in threads: t.join() print(f"Unsafe: {counter}") # NOT 200000!
spawn x = x + 1 and Thread B (the continuation) executes x = x + 1 before the sync. Both start with x = 0. What are the possible final values of x?You have an algorithm decomposed into a DAG of tasks. You have P processors. The question: how do you assign tasks to processors? A bad schedule wastes processors. A good schedule keeps everyone busy. The best schedule is greedy.
A greedy scheduler follows one rule: at every time step, if there is a ready task (all its dependencies are satisfied) and an idle processor, assign the task to that processor immediately. Never leave a processor idle when work is available.
This sounds too simple to be optimal. But it is nearly optimal:
This is the greedy scheduling bound. The first term is the perfect parallelism case (divide work equally). The second is the unavoidable critical path. The bound says a greedy scheduler is within a factor of 2 of optimal:
So TP(greedy) ≤ 2 · TP*. A greedy schedule is always at most twice the optimal. In practice, it is much better.
Greedy scheduling sounds great in theory, but in practice, a centralized scheduler that tracks all ready tasks is a bottleneck itself. The practical solution is work-stealing.
Each processor maintains a double-ended queue (deque) of tasks. When a processor spawns tasks, it pushes them onto the bottom of its own deque. When it finishes a task, it pops the next one from its deque. When its deque is empty — it has no work — it steals from a random other processor's deque (from the top, to minimize contention).
The expected running time of a work-stealing scheduler is:
Same bound as centralized greedy, but fully distributed. This is what Cilk, Intel TBB, Java ForkJoinPool, and Rust's Rayon all use.
This is a subtle but important design choice. Each processor's deque is ordered so that the most recently spawned tasks (smallest, deepest in the recursion tree) are at the bottom, and the oldest tasks (largest, shallowest) are at the top.
When a processor works on its own deque, it pops from the bottom (LIFO). This matches the natural recursion order and keeps cache-hot data close. When a thief steals, it takes from the top (FIFO) — grabbing the largest task. Why?
The proof of TP ≤ T1/P + T∞ is elegant. Classify each time step as either:
Every step is either complete or incomplete. So: TP = (complete steps) + (incomplete steps) ≤ T1/P + T∞. QED.
Consider a parallel computation with 4 independent chains of length 3 (total work = 12, span = 3).
The simulation below is the showcase. You control the number of processors and watch them execute a parallel task DAG using greedy scheduling with work-stealing. Idle processors steal from busy ones. The speedup graph updates in real time.
Watch P processors execute a task DAG. Teal = working. Gray = idle. Orange = stealing. Bottom graph shows speedup vs. processor count. Adjust P and task count.
Certain patterns appear over and over in parallel algorithms. Once you recognize them, you can parallelize almost anything. These are the "design patterns" of parallel computing.
Map applies a function to every element independently. No element depends on any other. This is embarrassingly parallel.
Examples: applying a filter to every pixel in an image (image processing on GPU — millions of independent pixels). Computing the square of every number in an array. Tokenizing every document in a corpus. Running inference on a batch of inputs through a neural network.
Map is so common that most languages have built-in support: Python's map(), Java streams' .map(), Rust's .par_iter().map(), CUDA's thread-per-element model. When someone says a problem is "embarrassingly parallel," they mean it is a map.
Reduce (also called fold) combines all elements into a single value using an associative operator ⊕. The key insight: because ⊕ is associative, we can group operations into a binary tree.
Work is Θ(n) (same as sequential). Span is Θ(log n) (tree depth). Parallelism is Θ(n / log n).
Scan computes all prefix sums: given [a1, a2, ..., an], produce [a1, a1⊕a2, a1⊕a2⊕a3, ...]. This seems inherently sequential — each output depends on all previous outputs. But there is a brilliant parallel algorithm.
The Blelloch scan uses two phases:
Scan is the Swiss army knife of parallel computing. It is used in: parallel sorting (radix sort), stream compaction (filtering), polynomial evaluation, solving tridiagonal systems, and even building parallel data structures.
The expected output (inclusive scan) is [3, 4, 8, 9, 14, 23, 25, 31]. Let us trace the Blelloch algorithm step by step.
Watch the up-sweep build partial sums, then the down-sweep distribute prefix sums. Orange = active at current step. Teal = completed.
The fork-join pattern is what we have been studying all along: recursively split a problem, solve subproblems in parallel, combine results. It is the general case that subsumes map and reduce.
Combine map and reduce and you get MapReduce — Google's framework for processing petabytes of data across thousands of machines. The map phase processes data in parallel. A shuffle phase groups results by key. The reduce phase combines each group.
python from functools import reduce from concurrent.futures import ProcessPoolExecutor def word_count_map(document): """Map: emit (word, 1) for each word""" counts = {} for word in document.split(): counts[word] = counts.get(word, 0) + 1 return counts def word_count_reduce(a, b): """Reduce: merge two count dicts""" result = dict(a) for k, v in b.items(): result[k] = result.get(k, 0) + v return result documents = ["the cat sat", "the dog sat", "the cat ran"] with ProcessPoolExecutor() as pool: mapped = list(pool.map(word_count_map, documents)) # parallel map total = reduce(word_count_reduce, mapped) # reduce # {'the': 3, 'cat': 2, 'sat': 2, 'dog': 1, 'ran': 1}
| Pattern | Input → Output | Work | Span | Example |
|---|---|---|---|---|
| Map | [a1..an] → [f(a1)..f(an)] | Θ(n) | Θ(1) | Image filter |
| Reduce | [a1..an] → a1⊕...⊕an | Θ(n) | Θ(log n) | Sum, max |
| Scan | [a1..an] → [prefix sums] | Θ(n) | Θ(log n) | Prefix sum, rank |
| Map-Reduce | Collection → aggregation | varies | varies | Word count |
The fork-join model from CLRS is the theoretical foundation. But in practice, parallel computing spans a spectrum of hardware and programming models. Understanding where each one fits is essential for choosing the right tool.
The fork-join model has been implemented in many languages and frameworks. Each one makes slightly different trade-offs between ease of use, overhead, and expressiveness. Here is the landscape:
| Technology | Model | Typical P | Best For |
|---|---|---|---|
| OpenMP | Pragma annotations on loops | 4-128 cores | Scientific computing, legacy C/Fortran |
| Intel TBB | Task-based, work-stealing | 4-128 cores | C++ applications, dynamic tasks |
| Rust Rayon | Fork-join with data parallelism | 4-128 cores | Safe parallelism (no races by construction) |
| Go goroutines | M:N green threads | 4-128 cores | I/O-bound concurrency, servers |
| Java ForkJoinPool | Work-stealing | 4-128 cores | JVM applications |
Among these, Rust Rayon deserves special mention. Rust's ownership system guarantees at compile time that parallel tasks cannot have data races. If your code compiles, it is race-free. This is the Cilk dream realized through a type system: the compiler enforces disjoint writes without runtime overhead. Changing .iter() to .par_iter() parallelizes a loop, and if it compiles, it is safe.
A GPU is a massively parallel processor with thousands of simple cores. NVIDIA's CUDA programming model uses SIMT (Single Instruction, Multiple Threads): groups of 32 threads (a warp) execute the same instruction simultaneously on different data.
GPUs excel at the map pattern: apply the same function to millions of data points. Matrix multiplication, convolution, and transformer attention are all map-heavy, which is why GPUs dominate deep learning.
A typical NVIDIA H100 GPU has 16,896 CUDA cores and 528 Tensor Cores. Peak throughput: 3,958 TFLOPS for FP8. Compare to a 64-core AMD EPYC CPU at ~10 TFLOPS. That is a 400× difference for data-parallel work. The tradeoff: each GPU core is simple (no branch prediction, small cache), so complex branchy code runs slower on GPU than CPU.
python # CUDA kernel (conceptual — runs on GPU) # Each thread handles one output element # This is the MAP pattern: same function, different data # Launch config: 256 threads per block, n/256 blocks # Total threads = n (one per element) def vector_add_kernel(a, b, c, n): idx = blockIdx.x * blockDim.x + threadIdx.x if idx < n: c[idx] = a[idx] + b[idx] # each thread does ONE add # Work = n, Span = 1 (all threads execute in parallel) # Parallelism = n (embarrassingly parallel)
In real programs, some fraction s of the work is inherently sequential (initialization, I/O, synchronization). Amdahl's Law says the maximum speedup is:
As P → ∞, the speedup approaches 1/s. If 5% of your program is sequential (s = 0.05), the maximum speedup is 20× no matter how many processors you add. This is sobering: a tiny serial fraction caps your speedup. Even with a million processors, you can never go more than 20× faster. The implication: to scale to large P, you must relentlessly minimize the serial fraction.
Amdahl's Law assumes a fixed problem size. Gustafson's Law (1988) takes a different perspective: as you add processors, you also increase the problem size (bigger models, more data). The serial fraction stays constant in absolute time, but the parallel fraction grows. Scaled speedup is:
This is linear in P. In practice, Gustafson's Law explains why GPU clusters with 10,000 GPUs are useful: you train bigger models, not the same model faster.
Suppose you have a program that takes 100 seconds on 1 processor: 10 seconds serial, 90 seconds parallelizable. So s = 0.1.
Adjust the serial fraction to see how it caps maximum speedup. Left: speedup vs processors. The dashed line shows ideal (linear) speedup for comparison.
python def amdahl_speedup(s, P): """Amdahl's Law: s = serial fraction, P = processors""" return 1.0 / (s + (1 - s) / P) def gustafson_speedup(s, P): """Gustafson's Law: scaled speedup""" return s + P * (1 - s) # Example: 10% serial fraction for P in [1, 4, 16, 64, 256, 1024]: print(f"P={P:4d}: Amdahl={amdahl_speedup(0.1, P):.1f}x, Gustafson={gustafson_speedup(0.1, P):.1f}x") # P= 1: Amdahl=1.0x, Gustafson=1.0x # P= 4: Amdahl=3.1x, Gustafson=3.7x # P= 16: Amdahl=6.4x, Gustafson=14.5x # P= 64: Amdahl=8.8x, Gustafson=57.7x # P= 256: Amdahl=9.7x, Gustafson=230.5x # P=1024: Amdahl=9.9x, Gustafson=921.7x ← Amdahl caps at 10x!
| Algorithm | Work T1 | Span T∞ | Parallelism |
|---|---|---|---|
| Parallel sum (reduce) | Θ(n) | Θ(log n) | Θ(n/log n) |
| Parallel prefix sum (scan) | Θ(n) | Θ(log n) | Θ(n/log n) |
| Parallel merge sort | Θ(n log n) | Θ(log3 n) | Θ(n/log2 n) |
| Parallel matrix multiply | Θ(n³) | Θ(log n) | Θ(n³/log n) |
| Parallel BFS | Θ(V+E) | Θ(D · log V) | Θ((V+E)/(D log V)) |
| Parallel map | Θ(n) | Θ(1) | Θ(n) |
| Parallel for loop | Θ(n · body) | Θ(log n + body) | Θ(n/log n) |
Q: Given an n-node binary tree, compute the size of every subtree in parallel.
A: This is a post-order traversal. For each node, subtree_size = 1 + left_size + right_size. The two recursive calls (left and right children) are independent — spawn them. After sync, add 1 + left + right. Work: Θ(n). Span: Θ(height). For a balanced tree, span is Θ(log n), giving parallelism Θ(n / log n). For a degenerate (linked-list) tree, span is Θ(n) and parallelism is 1 — no speedup possible.
Q: Parallelize counting sort for n integers in range [0, k].
A: The four-phase approach uses the parallel patterns from Chapter 6:
Total work: Θ(n + k · P). Span: Θ(n/P + k · log P + log k). For k = O(n) and P = O(n/log n), this is efficient.
Q: You have a function that processes web pages. Each page takes 50-200ms to fetch and 1ms to parse. How do you parallelize?
Q: You need to compute the product of all elements in an array. Can you parallelize it?
A: Yes! Multiplication is associative and commutative, so it is a reduce operation. Build a binary reduction tree: pair up elements, multiply pairs in parallel, repeat. Work: Θ(n). Span: Θ(log n). Same as parallel sum but with × instead of +. This generalizes to any associative operation: min, max, AND, OR, XOR, GCD, matrix multiplication (associative but not commutative), string concatenation. The only requirement is associativity: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c). Commutativity is not needed.
Q: Can you parallelize Dijkstra's shortest-path algorithm?
A: This is a tricky case. Dijkstra's algorithm processes vertices in order of distance, and each iteration depends on the previous (the closest unvisited vertex changes after each relaxation). The algorithm is inherently sequential in structure. However, the relaxation step (updating neighbors) can be parallelized: if the current vertex has d outgoing edges, we can relax all d neighbors in parallel. Work: O((V + E) log V). Span: O(V · (log V + Δ)), where Δ is the max degree. Parallelism is only O(E / (V · Δ)) — modest. For better parallelism, use Δ-stepping or parallel BFS-based approaches.
Q: You have a function that processes web pages. Each page takes 50-200ms to fetch and 1ms to parse. How do you parallelize?
A: This is I/O-bound, not compute-bound. The serial fraction is tiny (parsing is negligible). Use asynchronous I/O (asyncio in Python, goroutines in Go) rather than threads. Launch all fetches concurrently. Amdahl's Law says speedup ≈ P for I/O-bound work because s ≈ 0. With 100 concurrent fetches, you approach 100× speedup. Key insight: for I/O-bound work, you want concurrency (overlap waiting), not parallelism (multiple CPUs). Python's GIL does not matter here because threads spend 99% of their time waiting, not computing. Use asyncio for thousands of concurrent fetches with a single thread.
These derivations come up constantly in interviews. Practice them until they are automatic.
Parallel for loop over n iterations, each costing O(1):
Two nested parallel for loops (n × n):
Parallel for with O(n) body:
python from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor import numpy as np # Drill 1: Parallel map def parallel_map(func, data, n_workers=4): with ProcessPoolExecutor(max_workers=n_workers) as pool: return list(pool.map(func, data)) # Drill 2: Parallel reduce (tree-based) def parallel_reduce(arr, op): if len(arr) == 1: return arr[0] mid = len(arr) // 2 with ThreadPoolExecutor(max_workers=2) as pool: left = pool.submit(parallel_reduce, arr[:mid], op) right = pool.submit(parallel_reduce, arr[mid:], op) return op(left.result(), right.result()) # Drill 3: Parallel prefix sum (sequential, then parallelize) def prefix_sum(arr): """Sequential prefix sum for reference""" result = [0] * len(arr) result[0] = arr[0] for i in range(1, len(arr)): result[i] = result[i-1] + arr[i] return result # Drill 4: Work-span analysis def analyze(work, span, P): parallelism = work / span tp_bound = max(work / P, span) greedy_bound = work / P + span speedup = work / greedy_bound print(f"Work={work}, Span={span}, P={P}") print(f"Parallelism={parallelism:.0f}") print(f"Lower bound: {tp_bound:.0f}") print(f"Greedy bound: {greedy_bound:.0f}") print(f"Speedup: {speedup:.1f}x") analyze(work=1000000, span=20, P=64) # Work=1000000, Span=20, P=64 # Parallelism=50000 # Lower bound: 15625 # Greedy bound: 15645 # Speedup: 63.9x ← near-linear speedup!
| Mistake | Why It Is Wrong | Correct Thinking |
|---|---|---|
| "Just parallelize the inner loop" | Inner loop iterations may have dependencies (loop-carried deps) | Analyze the dependency graph first. Only parallelize truly independent iterations. |
| "More threads = more speed" | Ignores Amdahl's Law and overhead | Speedup is bounded by serial fraction and limited by T∞ |
| "Use a lock to fix the race" | Locks serialize access, defeating parallelism | Redesign to avoid shared mutable state. Use disjoint writes + reduce. |
| "Fork every element" | Spawn overhead per element dominates for small work | Use coarsening: fork into chunks (e.g., 1024 elements per task). This is the grain size optimization. |
| "Span = depth of recursion tree" | Span includes combine steps, not just recursion | Span = depth × cost-per-level + any serial combine work |
Multithreaded algorithms are one layer in a deep stack of parallel computing. The fork-join model we studied is the cleanest and most analyzable, but real systems combine multiple paradigms.
| Topic | This Chapter | In Practice |
|---|---|---|
| Model | Fork-join (Cilk) | + message passing (MPI), BSP, dataflow, actor model |
| Scheduling | Greedy, work-stealing | + gang scheduling, space sharing, OS-level CFS |
| Hardware | Shared memory multicore | + GPU (SIMT), distributed clusters, TPUs, FPGAs |
| Correctness | Determinacy races | + locks, semaphores, monitors, lock-free algorithms, transactional memory |
| Analysis | Work-span | + cache complexity, communication complexity, PRAM models |
The CLRS model assumes shared memory with uniform access time (any processor can read any memory location equally fast). Real machines have caches, NUMA (Non-Uniform Memory Access), and memory hierarchies. A parallel algorithm that is theoretically optimal can be slow in practice because of cache misses — data is not where the processor needs it. Cache-oblivious algorithms (also in CLRS, Ch 27 problems) address this.
Additionally, the model assumes that spawning a task is free. In practice, each spawn involves allocating a task descriptor, pushing to a deque, and potentially stealing involves inter-core communication. For very fine-grained tasks (e.g., adding two numbers), the spawn overhead dwarfs the computation. The practical fix is coarsening: use sequential execution below a grain size threshold, and only fork for sufficiently large subproblems. Cilk Plus and Rayon handle this automatically with adaptive grain sizes.
The model also ignores communication cost. On distributed systems (multiple machines connected by a network), sending data between nodes takes orders of magnitude longer than a local memory access. Algorithms designed for shared memory do not work well when the "shared" memory is actually a network.
Parallelism exists at every level of the computing stack. Chapter 27 focuses on algorithmic parallelism (the top level), but it helps to know what is below:
| Level | Mechanism | Who Controls It |
|---|---|---|
| Bit-level | 64-bit ALU processes 64 bits at once | Hardware |
| Instruction-level (ILP) | CPU pipeline, out-of-order execution, superscalar | Hardware + compiler |
| SIMD/Vector | AVX-512: one instruction on 16 floats at once | Compiler + intrinsics |
| Thread-level (TLP) | Multiple threads on multiple cores | Programmer (this chapter) |
| Node-level | Multiple machines, MPI, gRPC | Programmer + framework |
A well-optimized matrix multiply exploits ALL five levels simultaneously: SIMD registers process 16 floats per instruction, the CPU pipeline overlaps multiply and add, multiple cores each handle a tile of the output matrix, and across a cluster, each node handles a block of the computation. The theoretical analysis from CLRS (work/span) captures the thread-level parallelism; the other levels are implementation details that affect the constants.
Wafer-scale computing. Cerebras builds a single chip that is an entire 300mm silicon wafer — 850,000 cores, 40 GB of on-chip memory. No off-chip communication. The fork-join model maps naturally to this architecture. The key advantage: span-limited computations (like the layers of a neural network) benefit from zero inter-chip latency. Communication cost, normally the bottleneck in distributed systems, becomes negligible.
Chiplet architectures. AMD's EPYC processors use chiplets — multiple small dies connected on a package. A 128-core EPYC is actually 8 chiplets of 16 cores each. Memory access is non-uniform (NUMA): accessing your own chiplet's memory is fast; accessing another chiplet's memory is 2-3× slower. Algorithms from this chapter assume uniform memory access, but NUMA-aware scheduling can yield significant speedups by keeping data close to the processor that uses it.
Quantum parallelism. A quantum computer with n qubits can represent 2n states simultaneously. Shor's algorithm for factoring exploits this "parallelism" — though it is fundamentally different from classical parallelism (you cannot observe all 2n states; you must cleverly extract the answer through interference). Quantum computing does not replace classical parallelism; it addresses fundamentally different problem classes.
Photonic computing. Startups like Lightmatter and Luminous are building processors that use light instead of electrons. A photonic matrix multiplier computes in the time it takes light to pass through a chip (~0.1 ns), regardless of matrix size. This is constant-span matrix multiply — T∞ = O(1). The catch: precision is limited to about 4-8 bits, making it suitable for inference but not training.
Neural network training. Training GPT-4 required ~25,000 A100 GPUs for 90 days. The parallelism is a mix of three strategies, each mapping directly to concepts from this chapter:
| Strategy | CLRS Concept | How It Works |
|---|---|---|
| Data parallelism | Map pattern | Each GPU processes a different batch of data, computing gradients independently. Gradients are averaged (reduce) across GPUs. Work-efficient: same total compute as serial training. |
| Tensor parallelism | Parallel matrix multiply | Large weight matrices are split across GPUs. Each GPU computes a slice of the matrix multiply. Results are gathered (all-reduce). Span limited by communication latency. |
| Pipeline parallelism | Pipelining (overlap) | Different layers of the network on different GPUs. While GPU 1 processes layer 1 of batch 2, GPU 2 processes layer 2 of batch 1. Reduces idle time (the "bubble") but adds latency per sample. |
The critical path through the training pipeline determines the minimum training time. In practice, communication between GPUs (NVLink at 900 GB/s, InfiniBand at 400 Gb/s) is the main bottleneck — a manifestation of span that the pure fork-join model does not capture.
In the work-span framework: training has work proportional to (model size) × (dataset size) × (epochs). The span is proportional to (layers) × (communication rounds per step). Data parallelism reduces work per GPU but adds communication (all-reduce of gradients). Tensor parallelism reduces work per GPU but adds inter-GPU communication per layer. The art of distributed training is minimizing span while keeping work-efficiency close to 1.0.
python # Rust (Rayon) — the closest to Cilk's model # use rayon::prelude::*; # fn parallel_sum(arr: &[f64]) -> f64 { # arr.par_iter().sum() // parallel reduce # } # Go — goroutines (M:N threading, not fork-join but similar) # func parallelSum(arr []float64) float64 { # ch := make(chan float64, 2) # mid := len(arr) / 2 # go func() { ch <- sum(arr[:mid]) }() // spawn # go func() { ch <- sum(arr[mid:]) }() // spawn # return <-ch + <-ch // sync # } # Java — ForkJoinPool (work-stealing scheduler) # class SumTask extends RecursiveTask<Long> { # protected Long compute() { # if (hi - lo < THRESHOLD) return seqSum(arr, lo, hi); # int mid = (lo + hi) / 2; # SumTask left = new SumTask(arr, lo, mid); # left.fork(); // spawn # long right = new SumTask(arr, mid, hi).compute(); # return left.join() + right; // sync # } # } # Python — concurrent.futures (limited by GIL for CPU work) from concurrent.futures import ProcessPoolExecutor def parallel_sum(arr, n_workers=4): chunk = len(arr) // n_workers chunks = [arr[i*chunk:(i+1)*chunk] for i in range(n_workers)] with ProcessPoolExecutor(n_workers) as pool: return sum(pool.map(sum, chunks))
parallel_for in C++, .par_iter() in Rust, or ForkJoinPool in Java, you are using the exact model from this chapter. Work-stealing is the scheduler behind all of them. The work-span framework is how you reason about their performance. This theory is the foundation of every practical parallel system in use today.| # | Concept | One-Line Summary |
|---|---|---|
| 1 | Fork-Join Model | Spawn parallel tasks, sync when you need results |
| 2 | Work (T1) | Total operations = serial running time |
| 3 | Span (T∞) | Critical path = lower bound with infinite processors |
| 4 | Parallelism | T1/T∞ = max useful processors |
| 5 | Greedy bound | TP ≤ T1/P + T∞ |
| 6 | Work-stealing | Idle processors steal large tasks from busy ones |
| 7 | Parallel merge | Median splitting + binary search reduces merge span to O(log² n) |
| 8 | Determinacy races | Two parallel accesses to same location, one is write = nondeterminism |
| 9 | Amdahl's Law | Serial fraction s caps speedup at 1/s |
| 10 | Scan (prefix sum) | The Swiss army knife: O(n) work, O(log n) span |
"The way to go fast is not to do less work, but to do work at the same time." — Charles Leiserson (Cilk co-inventor, CLRS co-author)