Multithreaded Algorithms — CLRS Chapter 27

Chapter 0: The Problem

For forty years, software engineers had a free lunch. Every 18 months, clock speeds doubled. Your program ran twice as fast without you changing a single line of code. Moore's Law delivered performance on a silver platter. The Intel 4004 (1971) ran at 740 kHz. The Pentium 4 (2004) hit 3.8 GHz — a 5,000× increase in clock speed over 33 years.

Then, around 2005, the lunch ended. Clock speeds hit a wall near 4 GHz. Physics intervened: pushing electrons faster through silicon generates heat proportional to the cube of the frequency. A 4 GHz chip dissipates roughly 100 watts. Doubling to 8 GHz would require 800 watts — enough to melt the chip. This is called the power wall, and it ended the era of single-core speedups.

Intel, AMD, and ARM did not give up on performance — they changed strategy. Instead of making one core faster, they put multiple cores on the same chip. Your laptop today has 8 to 16 cores. A GPU has thousands. A modern data center node might have 128 CPU cores and 8 GPUs with 80,000 CUDA cores total.

But here is the uncomfortable truth: a 16-core machine running a single-threaded program uses exactly one core. The other 15 sit idle, burning power and generating heat while doing absolutely nothing useful. To use them, you must rewrite your algorithm to do multiple things at once. That is parallel computing, and it is no longer optional — it is the only path to faster programs.

The stakes are enormous. Machine learning models like GPT-4 would take over 100 years to train on a single GPU. Weather simulations that predict hurricanes 5 days in advance require trillions of floating-point operations per second. Protein folding, drug discovery, autonomous vehicles — every hard computational problem today is solved by parallel algorithms running on parallel hardware. If you write sequential code, you are voluntarily handicapping yourself to 1/16th (or 1/1000th on a GPU) of your machine's capacity.

The fundamental shift. Sequential speed is capped by physics. The only way forward is parallelism. But parallelism is not free: you must decompose your problem into independent pieces, coordinate their execution, and avoid conflicts when they touch shared data. This chapter teaches you the theory and practice of doing exactly that.

The model we will study is fork-join parallelism, also called the dynamic multithreading model. The idea is simple:

1. Fork (Spawn)

When you encounter independent subproblems, launch them as parallel tasks. A spawn says "start this in the background and keep going." Like splitting a stack of exams between two friends.

↓

2. Work

Each task runs independently on whatever processor is available. No coordination needed while tasks are working on separate data.

↓

3. Join (Sync)

When you need the results of all spawned tasks, you sync — wait for everyone to finish before continuing. Like waiting for both friends to return their sorted halves before you merge.

This is the same pattern as divide-and-conquer, but with a twist: the recursive calls run simultaneously instead of one after the other. Merge sort becomes parallel merge sort by spawning the two recursive calls. The merge step still waits for both halves to finish (sync), then combines them.

Serial consistency guarantee. A correctly written fork-join program always produces the same result as its serial version. If you remove all spawn and sync keywords, you get a valid serial program. This means you can debug by running serially, then add parallelism for speed. This is enormously powerful compared to traditional multithreading, where the parallel and serial versions may behave completely differently.

The model makes a simplifying assumption: we have an ideal scheduler that can assign any ready task to any idle processor in O(1) time, and memory access is uniform (any processor can read any memory location at equal cost). Real hardware violates both assumptions (schedulers are not free; NUMA and caches create non-uniform access), but the model gives tight performance bounds that hold remarkably well in practice.

The simulation below shows the difference. On the left, a sequential program executes tasks one at a time. On the right, a fork-join program overlaps independent tasks. Watch how the total time drops even though the total work stays the same.

Sequential vs. Fork-Join Execution

Left: tasks execute one by one. Right: independent tasks run in parallel. Same total work, less wall-clock time. Adjust processor count to see the effect.

Processors 4

Click Run to see sequential vs parallel

The Cilk Keywords

CLRS uses the Cilk concurrency model from MIT (later commercialized by Intel). Cilk extends a serial language with exactly three keywords:

Keyword	Meaning	Analogy
spawn	Launch a function call as a parallel task. The caller continues immediately without waiting.	"Hey friend, sort this half while I sort mine."
sync	Wait for all previously spawned tasks to complete before continuing.	"Wait for my friend to finish before I merge."
parallel for	Execute all iterations of a loop in parallel. Syntactic sugar for spawn + sync on each iteration.	"Everyone take one exam and grade it."

The beauty of this model is that a Cilk program with all spawn/sync removed is a valid serial program that produces the correct answer. The parallel keywords are performance hints, not correctness requirements. This property — serial equivalence — makes reasoning about correctness much easier than traditional threading.

// Fibonacci: the simplest fork-join example
FIB(n):
  if n ≤ 1: return n
  x = spawn FIB(n - 1) // compute fib(n-1) in parallel
  y = FIB(n - 2)         // compute fib(n-2) serially (continuation)
  sync                    // wait for x to be ready
  return x + y

// Work: T₁ = Θ(φⁿ) — exponential (same as serial fib)
// Span: T∞ = Θ(n) — the longest chain of dependencies
// Parallelism: Θ(φⁿ/n) — enormous for large n

The key questions this chapter answers are: How fast can a parallel algorithm go? What is the theoretical limit? How do we schedule tasks across processors efficiently? And most importantly: What goes wrong when parallel tasks step on each other's toes?

Concept check: You have an 8-core machine running a purely sequential program. How many cores are actually doing useful work?

1 — sequential code uses a single core; the other 7 are idle 8 — the OS automatically distributes work 4 — half the cores handle overhead It depends on the problem size

Chapter 1: Work and Span

To reason about parallel algorithms, we need two numbers that capture everything important about their performance. Forget about the number of processors for a moment — the algorithm itself has intrinsic properties that determine how parallelizable it is.

Work: Total Operations

Work, written T₁, is the total number of operations the algorithm performs. It is exactly the running time on a single processor — the serial time. Think of it as the total amount of "stuff to do." If your algorithm does n additions and n multiplications, the work is 2n regardless of how many processors you have. Work is a property of the algorithm, not the hardware. A 1-core machine takes T₁ time. A 100-core machine still does T₁ total operations — it just spreads them across cores.

Span: The Critical Path

Span, written T_∞, is the longest chain of sequentially dependent operations — the critical path. It is the time the algorithm would take with an infinite number of processors. Even if you had a million cores, you cannot go faster than the span, because the operations along the critical path must happen one after another. The subscript ∞ reminds us: this is the time with infinite processors. In the DAG picture, the span is the length of the longest path from any source to any sink.

The dependency chain analogy. Imagine cooking a meal. Chopping vegetables (5 min) and boiling water (10 min) can happen in parallel. But you cannot put pasta in the pot until the water boils. The critical path is: boil water (10 min) → cook pasta (8 min) = 18 min. Even with 100 sous-chefs, the meal takes at least 18 minutes. The total work (5 + 10 + 8 = 23 min) tells you how long it takes with one cook. The span (18 min) tells you the absolute minimum time regardless of manpower.

Parallelism

The ratio T₁ / T_∞ is called the parallelism. It tells you the maximum useful number of processors. If work = 1000 and span = 10, the parallelism is 100 — adding more than 100 processors gives no further speedup. Beyond 100, extra processors sit idle because there are not enough independent tasks to keep them busy. This is the algorithm's "parallelizability score."

Some algorithms have enormous parallelism. Parallel matrix multiply has parallelism Θ(n³/log n) — for 1000×1000 matrices, that is about 100 million. Other algorithms have pitiful parallelism. Sequential scan (prefix sum, without the Blelloch trick) has parallelism 1 — completely serial. The goal of parallel algorithm design is to maximize parallelism while keeping work efficient.

The Work-Span Bound

The running time on P processors, T_P, is bounded from below by two constraints:

T_P ≥ max(T₁ / P, T_∞)

The first term says you cannot do better than dividing the work equally. The second says you cannot beat the critical path. A good parallel algorithm minimizes both: low span (short critical path) and efficient work (not much more than the serial algorithm).

Speedup

Speedup on P processors is T₁ / T_P. If T_P = T₁ / P, we have linear speedup — the ideal case. If T_P = T₁ · 1 (no improvement), speedup is 1. Perfect parallelism gives speedup equal to P.

The Computation DAG

Every parallel computation can be represented as a directed acyclic graph (DAG). Each node represents a unit of work (an instruction, a function call, a loop body). Each edge represents a dependency: "this must finish before that can start." The DAG captures everything about the parallel structure of the algorithm.

There are three types of edges in a fork-join DAG:

Edge Type	What It Represents	Direction
Continuation	Sequential control flow: one instruction after another	Within the same thread, top to bottom
Spawn	A parent spawns a child task. The child begins; the parent continues.	Parent → child (creating a new branch)
Join (sync)	A task waits for its spawned children to finish	Child → sync point in parent

The work T₁ is the total number of nodes (summing all node weights if nodes have different costs). The span T_∞ is the weight of the longest path from any source node (no predecessors) to any sink node (no successors). Finding the critical path is equivalent to finding the longest path in a DAG, which can be done in O(V + E) time using topological sort.

Why DAG, not tree? A fork-join computation forms a tree of spawn/sync relationships, but the dependency graph is a DAG because sync edges create cross-links. When a parent syncs, it depends on all its children, creating multiple incoming edges at the sync point. These sync edges are what limit parallelism — every sync is a point where independent work must wait for slow siblings to finish.

The simulation below shows a computation DAG (directed acyclic graph). Each node is an operation, each edge is a dependency. The work is the total number of nodes. The span is the length of the longest path from source to sink, highlighted in orange. Adjust the DAG shape and see how work and span change.

Computation DAG: Work and Span

Each circle is one operation. Arrows show dependencies. Orange path = critical path (span). Total nodes = work. Adjust the branching factor to see how parallelism changes.

Branching 3

Depth 3

Work Efficiency

A parallel algorithm is work-efficient if its work matches the best serial algorithm. For example, parallel merge sort does O(n log n) work — the same as serial merge sort. It is work-efficient. Some parallel algorithms trade extra work for shorter span. That tradeoff is acceptable only if the parallelism is high enough to compensate.

Worked Example: Analyzing Parallel Fibonacci

Let us trace through the parallel Fibonacci from Chapter 0 with n = 5.

The computation DAG for FIB(5) has FIB(5) at the root, which spawns FIB(4) and calls FIB(3). FIB(4) spawns FIB(3) and calls FIB(2). And so on, down to the base cases FIB(0) and FIB(1).

// Work analysis: every node in the recursion tree is visited once
T₁(n) = T₁(n-1) + T₁(n-2) + Θ(1)
      = Θ(φⁿ)   // same as serial fib — exponential

// Span analysis: spawn runs in parallel with continuation
T_∞(n) = max(T_∞(n-1), T_∞(n-2)) + Θ(1)
       = T_∞(n-1) + Θ(1)   // max of n-1 and n-2 is n-1
       = Θ(n)

// For n=5: Work = 15 nodes, Span = 5, Parallelism = 3
// For n=30: Work = 2,692,537, Span = 30, Parallelism = 89,751

Span uses max, not sum. The crucial difference between serial and parallel analysis: when two tasks are spawned, the span is the maximum of their spans (they overlap in time), not the sum. This is why the span of FIB(n) is Θ(n) instead of Θ(φⁿ). The max corresponds to the critical path — the slower of the two parallel branches determines when both are done.

The Slackness Condition

When is adding more processors actually useful? When the parallelism T₁/T_∞ is much larger than P. Specifically, if T₁/T_∞ ≫ P (there is sufficient slackness), then the greedy scheduling bound T₁/P + T_∞ is dominated by T₁/P, and we get near-linear speedup:

T_P ≤ T₁/P + T_∞ ≈ T₁/P when T₁/T_∞ ≫ P

Rule of thumb: you want parallelism to be at least 10× the processor count to achieve near-linear speedup. If parallelism is 100 and P is 10, you are in good shape. If parallelism is 10 and P is 10, much of the potential is wasted on critical path overhead.

Measure	Symbol	Meaning	Analogy
Work	T₁	Total operations (serial time)	Total person-hours to build a house
Span	T_∞	Critical path (infinite processors)	Minimum time even with unlimited workers
Parallelism	T₁/T_∞	Max useful processors	How many workers can be productive at once
Speedup	T₁/T_P	Factor of improvement on P processors	How much faster the house gets built

Concept check: An algorithm has work T₁ = 10,000 and span T_∞ = 100. What is the parallelism, and what happens if you run it on 200 processors?

Parallelism 200; 200 processors give perfect speedup Parallelism 100; beyond 100 processors, extra cores sit idle (bounded by span) Parallelism 50; span doubles with more processors Cannot determine without knowing the algorithm

Chapter 2: Parallel Merge Sort

Merge sort is the poster child of divide-and-conquer. It splits the array in half, recursively sorts each half, and merges the results. Making the two recursive calls parallel is trivial — just spawn them. But the merge step is the bottleneck.

Naive Parallel Merge Sort

The obvious approach: spawn the two recursive sorts and sync before merging.

PARALLEL-MERGE-SORT(A, lo, hi):
  // Base case
  if lo ≥ hi: return
  mid = ⌊(lo + hi) / 2⌋
  spawn PARALLEL-MERGE-SORT(A, lo, mid) // left half in parallel
  PARALLEL-MERGE-SORT(A, mid+1, hi)    // right half continues
  sync                                // wait for left to finish
  MERGE(A, lo, mid, hi)               // sequential merge: O(n)

The work is O(n log n) — same as serial merge sort. But what about the span?

The two recursive calls run in parallel, so the span of the recursion is T_∞(n) = T_∞(n/2) + Θ(n). The Θ(n) is the sequential merge. Solving: T_∞ = Θ(n). That is terrible — the span equals the work order, giving parallelism of only O(log n). The merge is the bottleneck.

The merge bottleneck. In naive parallel merge sort, the merge step is sequential: it scans both halves element by element in O(n) time. This single serial operation dominates the span. No matter how many processors you throw at the problem, you cannot merge faster than O(n) with the naive approach. The entire point of parallel merge sort is fixing this bottleneck.

Parallel Merge

The key insight: we can merge two sorted arrays in parallel using a median-splitting strategy. Here is how it works:

1. Find the median of the larger array

Pick the middle element x of the larger sorted array T. This takes O(1).

↓

2. Binary search in the smaller array

Find where x would be inserted in the smaller array S using binary search. This takes O(log n) and splits S into elements ≤ x and elements > x.

↓

3. Place x and recurse

x goes to its correct position in the output. Now spawn two subproblems: merge the left parts (elements ≤ x from both arrays) and merge the right parts (elements > x from both arrays). These are independent — they write to disjoint regions of the output.

Each recursive call handles at most 3n/4 elements (since we split the larger array in half and the smaller array at some point). The span recurrence is:

T_∞(n) = T_∞(3n/4) + Θ(log n)

By the master theorem (or Akra-Bazzi), this solves to T_∞ = Θ(log² n). The work for parallel merge is Θ(n) — same as sequential merge (work-efficient).

Full Parallel Merge Sort Complexity

With the parallel merge, the span of the full merge sort becomes:

T_∞(n) = T_∞(n/2) + Θ(log² n) = Θ(log³ n)

The work remains O(n log n). So the parallelism is:

Parallelism = T₁ / T_∞ = Θ(n log n / log³ n) = Θ(n / log² n)

For n = 10 million, that is about 70,000 — meaning we can productively use tens of thousands of processors. That is real parallelism.

Algorithm	Work T₁	Span T_∞	Parallelism
Serial merge sort	Θ(n log n)	Θ(n log n)	1
Naive parallel MS	Θ(n log n)	Θ(n)	Θ(log n)
Parallel MS (parallel merge)	Θ(n log n)	Θ(log³ n)	Θ(n / log² n)

Parallel Merge Sort DAG

Watch the fork-join structure. Teal = spawn (fork). Orange = sync (join). Red path = critical path. Compare naive merge (sequential) vs parallel merge.

Select a mode to visualize

Why the Parallel Merge Uses 3n/4

This bound is crucial and often confuses students. When we pick the median of the larger array T (size m), we split T exactly in half: m/2 elements on each side. The binary search in the smaller array S (size n-m) splits S at some point, say j elements left and (n-m-j) right.

The larger subproblem gets at most m/2 + (n-m) elements. Since m ≥ n-m (T is larger), m/2 + (n-m) ≤ m/2 + m = 3m/2. Since m ≤ n, this is at most 3n/4. Each recursive call handles at most 3/4 of the total elements. This geometric shrinkage is what gives us the Θ(log² n) span.

Work-efficiency check. The parallel merge does Θ(n) work (same as sequential merge) because the total number of elements across all recursive calls sums to n at each recursion level, and there are O(log n) levels, but each element participates in at most one binary search of cost O(log n). Careful accounting: T₁ = Θ(n) for the merge, not Θ(n log n). The parallel merge sort is work-efficient.

python
import threading

def parallel_merge(T, S, output, t_lo, t_hi, s_lo, s_hi, out_lo):
    """Merge sorted T[t_lo:t_hi] and S[s_lo:s_hi] into output[out_lo:]"""
    n_t = t_hi - t_lo
    n_s = s_hi - s_lo
    if n_t < n_s:  # ensure T is the larger array
        return parallel_merge(S, T, output, s_lo, s_hi, t_lo, t_hi, out_lo)
    if n_t == 0:
        return
    if n_t + n_s <= 64:  # base case: serial merge for small inputs
        serial_merge(T, S, output, t_lo, t_hi, s_lo, s_hi, out_lo)
        return
    mid_t = (t_lo + t_hi) // 2
    x = T[mid_t]
    mid_s = bisect_left(S, x, s_lo, s_hi)  # O(log n) binary search
    out_mid = out_lo + (mid_t - t_lo) + (mid_s - s_lo)
    output[out_mid] = x
    # Spawn two independent merges
    left = threading.Thread(target=parallel_merge,
        args=(T, S, output, t_lo, mid_t, s_lo, mid_s, out_lo))
    left.start()
    parallel_merge(T, S, output, mid_t+1, t_hi, mid_s, s_hi, out_mid+1)
    left.join()  # sync

Concept check: Why does the naive parallel merge sort have span Θ(n) despite parallelizing the recursive calls?

The sequential merge step takes Θ(n) and dominates the span The spawning overhead is Θ(n) The base case is too large The two halves are not truly independent

Chapter 3: Parallel Matrix Multiply

Matrix multiplication is the backbone of scientific computing, machine learning, and graphics. Every time a neural network does a forward pass, it multiplies matrices. The standard algorithm is O(n³) for n × n matrices. How much can parallelism help?

The Standard Algorithm

Recall: C = A × B where C[i][j] = ∑_k A[i][k] · B[k][j]. The triple nested loop:

for i = 1 to n:          // row of A
  for j = 1 to n:        // column of B
    C[i][j] = 0
    for k = 1 to n:      // dot product
      C[i][j] += A[i][k] * B[k][j]

Tensor Shapes and Data Flow

Let us be concrete about what flows where. Matrix A is n × n (say 1024 × 1024 of float32, that is 4 MB). Matrix B is n × n (another 4 MB). Output C is n × n (4 MB, initialized to zeros). The triple loop visits n³ = 1,073,741,824 multiply-add operations. On a modern CPU doing 64 GFLOPS, this takes about 16 ms. On a GPU doing 100 TFLOPS, it takes about 10 μs. The difference is pure parallelism.

// Data flow for C[i][j]:
// Input: row i of A (n floats) + column j of B (n floats)
// Compute: dot product = n multiplies + (n-1) adds
// Output: single float at C[i][j]
//
// Key observation: C[i][j] depends only on row i of A
// and column j of B. No other C entries are read or written.
// Therefore ALL n² entries of C can be computed independently.

Parallelizing the Outer Loops

The simplest parallel approach: the n² output entries C[i][j] are independent of each other. Each one is a dot product of row i of A with column j of B. Spawn all n² dot products in parallel.

Each dot product takes Θ(n) work and Θ(log n) span (using parallel reduction to sum n products). The full analysis:

Work = Θ(n³), Span = Θ(log n), Parallelism = Θ(n³ / log n)

This is spectacularly parallel. For 1000 × 1000 matrices, the parallelism is about 10⁸. The span is only O(log n) because all the dot products run simultaneously, and each dot product's sum is a parallel reduction tree of depth log n.

Why log n for the dot product? Summing n numbers sequentially takes n steps. But with parallel reduction, you pair up the numbers and add each pair simultaneously: n/2 additions in step 1, n/4 in step 2, etc. After log n steps, you have one number. This is the parallel sum pattern — we will see it again in Chapter 6.

Divide-and-Conquer Matrix Multiply

CLRS presents a recursive approach: partition each n × n matrix into four n/2 × n/2 blocks and perform 8 recursive multiplications plus 4 additions. The recurrence is T(n) = 8T(n/2) + Θ(n²), giving T(n) = Θ(n³) serial.

In parallel, spawn all 8 sub-multiplies. Since the additions must wait for the multiplies (sync), the span is:

T_∞(n) = T_∞(n/2) + Θ(log² n) = Θ(log² n · log n) = Θ(log³ n)

Wait — actually, the 8 recursive calls are spawned, but pairs that write to the same output quadrant must be serialized (they add to the same block). So we group them: 4 pairs, each pair sequential. Span becomes T_∞(n) = 2 · T_∞(n/2) + Θ(log n), which solves to Θ(n).

To fix this, we can use a temporary matrix to accumulate partial products, then add in parallel. With care, the span drops to Θ(log² n). Strassen's algorithm reduces the 8 multiplications to 7, giving work Θ(n^log₂7) ≈ Θ(n^2.81) while maintaining excellent parallelism.

Parallel Matrix Multiply: Work Distribution

Each cell C[i][j] is an independent dot product. Watch how P processors divide the n² cells among themselves. Drag the slider to change matrix size.

Matrix size n 8

Processors P 4

Method	Work	Span	Parallelism
Loop parallelism	Θ(n³)	Θ(log n)	Θ(n³/log n)
D&C parallel	Θ(n³)	Θ(log² n)	Θ(n³/log² n)
Strassen parallel	Θ(n^2.81)	Θ(log² n)	Θ(n^2.81/log² n)

python
import numpy as np
from concurrent.futures import ThreadPoolExecutor

def parallel_matmul(A, B, n_workers=4):
    """Parallel matrix multiply: each worker computes a block of rows."""
    n = A.shape[0]
    C = np.zeros((n, n))
    chunk = n // n_workers

    def compute_rows(start):
        end = min(start + chunk, n)
        C[start:end] = A[start:end] @ B  # each worker writes to disjoint rows

    with ThreadPoolExecutor(max_workers=n_workers) as pool:
        futures = [pool.submit(compute_rows, i * chunk) for i in range(n_workers)]
        for f in futures:
            f.result()  # sync: wait for all workers
    return C

# No races: each worker writes to disjoint rows of C
# Work: O(n³) same as serial. Span: O(n³/P) per worker = O(n²·n/P)

Why GPUs dominate matrix multiply. An n×n matrix multiply has n³ independent multiply-add operations (one per element of each dot product). With parallelism of n³/log n, a 1000×1000 multiply can use ~10⁸ processors productively. A GPU with thousands of cores achieves a large fraction of this parallelism, which is why NVIDIA GPUs are the engine of modern deep learning. Every forward pass through a neural network is mostly matrix multiplies.

Concept check: You have two 1000 × 1000 matrices to multiply. The loop-parallel approach has span Θ(log n). How many time steps is that approximately?

1000 steps About 10 steps (log₂ 1000 ≈ 10) 10⁹ steps About 100 steps

Chapter 4: Race Conditions

Parallelism gives speed, but it also creates a new class of bugs that do not exist in sequential programs. The most insidious is the race condition: when two parallel tasks access the same memory location, and at least one writes to it, the result depends on who gets there first.

A Simple Example

Suppose two threads both execute x = x + 1 where x starts at 0. You expect x = 2 afterward. But the operation x = x + 1 is not atomic — it involves three steps:

1. Read

Load the current value of x into a register.

↓

2. Compute

Add 1 to the register.

↓

3. Write

Store the register value back to x.

If both threads read x = 0 before either writes, they both compute 0 + 1 = 1, and both write 1. The final value is 1 instead of 2. One increment was lost. This is a determinacy race: the program produces different results depending on the scheduling order.

Let us trace the two possible interleavings explicitly:

// SAFE interleaving (serial order):
Thread A: read x → 0
Thread A: compute 0 + 1 = 1
Thread A: write x ← 1
Thread B: read x → 1   // sees A's write
Thread B: compute 1 + 1 = 2
Thread B: write x ← 2
Final x = 2 ✓

// UNSAFE interleaving (race!):
Thread A: read x → 0
Thread B: read x → 0   // reads BEFORE A writes!
Thread A: compute 0 + 1 = 1
Thread B: compute 0 + 1 = 1
Thread A: write x ← 1
Thread B: write x ← 1   // overwrites A's result!
Final x = 1 ✗   // lost one increment

Races are silent killers. A race condition might produce the correct answer 99% of the time and fail only under specific timing conditions. This makes them nearly impossible to debug by running the program repeatedly. The failure might appear only under heavy load, on a different machine, or during a demo in front of your CEO. The only defense is to prove your program is race-free.

Types of Races

Type	Description	Example
Read-Write race	One thread reads while another writes the same location	Thread A reads x, Thread B writes x
Write-Write race	Two threads write to the same location	Both threads do x = something
Read-Read	Two threads read the same location	Not a race — reads are safe

The Cilk Model

CLRS uses the Cilk concurrency model (from MIT, later adopted by Intel). In Cilk, a program is deterministic if its parallel execution always produces the same result as its serial execution. A program with no determinacy races is guaranteed to be deterministic.

The rule is simple: if a spawned task and its continuation (the code that runs after the spawn) both access the same variable, and at least one writes, that is a race. The fix? Either make sure they access disjoint memory, or put the access after the sync.

// RACE: spawn and continuation both write x
x = 0
spawn x = x + 1 // task A reads and writes x
x = x + 1        // continuation also reads and writes x
sync
print x          // could be 1 or 2!

// FIXED: use disjoint variables, combine after sync
a = 0; b = 0
spawn a = 1       // task A writes only a
b = 1             // continuation writes only b
sync
x = a + b         // combine after sync: always 2

The simulation below shows a race condition in action. Two threads try to increment a shared counter 5 times each. Watch the interleaving — some increments are lost because both threads read the old value before either writes.

Race Condition: Lost Updates

Two threads each increment a shared counter. Watch for red conflicts where both read the same value — one increment is lost. Expected result: 10. Actual result: usually less.

Click Run to see the interleaving

Detecting Races: The Serial-Equivalence Test

How do you know if your parallel program has a race? CLRS gives a clean definition: a program has a determinacy race if two logically parallel instructions access the same memory location and at least one is a write. "Logically parallel" means there is no spawn-sync path ordering them — the scheduler could execute them in either order.

The Cilk runtime includes a race detector called Cilkscreen (analogous to ThreadSanitizer for C++ or -race flag for Go). It runs the program serially but tracks which memory locations are accessed by which logical tasks, flagging any pair that could conflict. The overhead is about 4-6× slowdown, which is acceptable for testing but not production.

Race-free ≠ lock-free. In the Cilk model, you avoid races by not sharing mutable state between logically parallel tasks. Each spawned task operates on its own data; shared state is only accessed after a sync. This is fundamentally different from the locks-and-mutexes approach of traditional threading. Locks are necessary when you must share mutable state; the Cilk model says "redesign your algorithm so you don't have to."

Practical Race Avoidance Patterns

Pattern	Description	Example
Disjoint writes	Each task writes to a separate region	Task i fills output[ichunk..(i+1)chunk]
Reduce after sync	Each task accumulates locally, merge after sync	Local counters merged at sync point
Copy-in / copy-out	Each task gets its own copy of input data	Each merge sort call gets its own buffer
Functional style	No mutation; return new values	Recursive fib returns values, never writes shared state

python
import threading

counter = 0
lock = threading.Lock()

def unsafe_increment():
    global counter
    for _ in range(100000):
        counter += 1  # RACE: read-modify-write is not atomic

def safe_increment():
    global counter
    for _ in range(100000):
        with lock:      # mutex ensures mutual exclusion
            counter += 1

# Unsafe: result varies (often ~150,000 instead of 200,000)
threads = [threading.Thread(target=unsafe_increment) for _ in range(2)]
for t in threads: t.start()
for t in threads: t.join()
print(f"Unsafe: {counter}")  # NOT 200000!

Concept check: Thread A executes spawn x = x + 1 and Thread B (the continuation) executes x = x + 1 before the sync. Both start with x = 0. What are the possible final values of x?

Always 2 Always 1 Either 1 or 2 — depends on which thread reads x first 0, 1, or 2

Chapter 5: Scheduling

You have an algorithm decomposed into a DAG of tasks. You have P processors. The question: how do you assign tasks to processors? A bad schedule wastes processors. A good schedule keeps everyone busy. The best schedule is greedy.

Greedy Scheduling

A greedy scheduler follows one rule: at every time step, if there is a ready task (all its dependencies are satisfied) and an idle processor, assign the task to that processor immediately. Never leave a processor idle when work is available.

This sounds too simple to be optimal. But it is nearly optimal:

T_P ≤ T₁/P + T_∞

This is the greedy scheduling bound. The first term is the perfect parallelism case (divide work equally). The second is the unavoidable critical path. The bound says a greedy scheduler is within a factor of 2 of optimal:

T_P^* ≥ max(T₁/P, T_∞) ≥ (T₁/P + T_∞) / 2

So T_P(greedy) ≤ 2 · T_P^*. A greedy schedule is always at most twice the optimal. In practice, it is much better.

Why greedy works. Think about it: a step in the schedule is either complete (all P processors are busy) or incomplete (some are idle). During complete steps, we are doing P units of work per step — optimal. During incomplete steps, every ready task is assigned, which means all idle processors are waiting on dependencies — we are making progress along the critical path. Complete steps eat into T₁/P. Incomplete steps eat into T_∞. Together: T₁/P + T_∞.

Work-Stealing

Greedy scheduling sounds great in theory, but in practice, a centralized scheduler that tracks all ready tasks is a bottleneck itself. The practical solution is work-stealing.

Each processor maintains a double-ended queue (deque) of tasks. When a processor spawns tasks, it pushes them onto the bottom of its own deque. When it finishes a task, it pops the next one from its deque. When its deque is empty — it has no work — it steals from a random other processor's deque (from the top, to minimize contention).

Own work available

Pop from the bottom of your deque. This is fast — no lock needed (LIFO order matches recursion).

↓

Deque empty

Steal from the top of a random victim's deque. Stealing from the top takes the largest (oldest) tasks, balancing load effectively.

The expected running time of a work-stealing scheduler is:

E[T_P] = O(T₁/P + T_∞)

Same bound as centralized greedy, but fully distributed. This is what Cilk, Intel TBB, Java ForkJoinPool, and Rust's Rayon all use.

Why Work-Stealing Steals from the Top

This is a subtle but important design choice. Each processor's deque is ordered so that the most recently spawned tasks (smallest, deepest in the recursion tree) are at the bottom, and the oldest tasks (largest, shallowest) are at the top.

When a processor works on its own deque, it pops from the bottom (LIFO). This matches the natural recursion order and keeps cache-hot data close. When a thief steals, it takes from the top (FIFO) — grabbing the largest task. Why?

Big steals, fewer steals. If you steal the largest available task, you get the most work per steal. This minimizes the number of steals (each one incurs communication overhead). The expected total number of steals across all processors is O(P · T_∞) — proportional to the number of processors times the critical path length. Since each steal costs O(1) amortized, the total stealing overhead is small.

Proof Sketch of the Greedy Bound

The proof of T_P ≤ T₁/P + T_∞ is elegant. Classify each time step as either:

Complete step

All P processors are busy. At least P units of work are completed. There can be at most T₁/P complete steps (since total work is T₁).

↓

Incomplete step

Some processor is idle. Since the scheduler is greedy, every ready task is running. The idle processor is waiting on dependencies, meaning at least one task on the critical path completes. There can be at most T_∞ incomplete steps.

Every step is either complete or incomplete. So: T_P = (complete steps) + (incomplete steps) ≤ T₁/P + T_∞. QED.

Worked Example: Scheduling a Fork-Join DAG

Consider a parallel computation with 4 independent chains of length 3 (total work = 12, span = 3).

P = 1: T₁ = 12       // all 12 tasks serial
P = 2: T₂ ≤ 12/2 + 3 = 9 // greedy bound
       Actual T₂ = 6     // 2 chains per processor, 3 steps each
P = 4: T₄ ≤ 12/4 + 3 = 6 // greedy bound
       Actual T₄ = 3     // 1 chain per processor, 3 steps
P = 8: T₈ ≤ 12/8 + 3 = 4.5
       Actual T₈ = 3     // bounded by span, 4 procs idle

// Parallelism = 12/3 = 4. Beyond P=4, no improvement.

The simulation below is the showcase. You control the number of processors and watch them execute a parallel task DAG using greedy scheduling with work-stealing. Idle processors steal from busy ones. The speedup graph updates in real time.

Greedy Scheduling with Work-Stealing

Watch P processors execute a task DAG. Teal = working. Gray = idle. Orange = stealing. Bottom graph shows speedup vs. processor count. Adjust P and task count.

Processors P 4

Tasks 48

Adjust parameters, then Run

Concept check: An algorithm has work T₁ = 1000 and span T_∞ = 50. On P = 10 processors, what does the greedy bound guarantee?

T₁₀ = 100 exactly (perfect speedup) T₁₀ ≤ 1000/10 + 50 = 150 (at most 150 time steps) T₁₀ ≤ 50 (bounded by span alone) Cannot determine without knowing the DAG structure

Chapter 6: Parallel Patterns

Certain patterns appear over and over in parallel algorithms. Once you recognize them, you can parallelize almost anything. These are the "design patterns" of parallel computing.

Map

Map applies a function to every element independently. No element depends on any other. This is embarrassingly parallel.

// Map: apply f to every element
parallel for i = 1 to n:
B[i] = f(A[i])

// Work = Θ(n · cost(f)), Span = Θ(cost(f)), Parallelism = n

Examples: applying a filter to every pixel in an image (image processing on GPU — millions of independent pixels). Computing the square of every number in an array. Tokenizing every document in a corpus. Running inference on a batch of inputs through a neural network.

Map is so common that most languages have built-in support: Python's map(), Java streams' .map(), Rust's .par_iter().map(), CUDA's thread-per-element model. When someone says a problem is "embarrassingly parallel," they mean it is a map.

Reduce

Reduce (also called fold) combines all elements into a single value using an associative operator ⊕. The key insight: because ⊕ is associative, we can group operations into a binary tree.

// Sequential reduce: O(n) work, O(n) span
result = A[1]
for i = 2 to n:
result = result ⊕ A[i]

// Parallel reduce: O(n) work, O(log n) span
// Pair up elements, combine, repeat
// Level 0: n elements
// Level 1: n/2 partial results
// Level 2: n/4 partial results
// ...
// Level log n: 1 final result

Work is Θ(n) (same as sequential). Span is Θ(log n) (tree depth). Parallelism is Θ(n / log n).

Scan (Parallel Prefix Sum)

Scan computes all prefix sums: given [a₁, a₂, ..., a_n], produce [a₁, a₁⊕a₂, a₁⊕a₂⊕a₃, ...]. This seems inherently sequential — each output depends on all previous outputs. But there is a brilliant parallel algorithm.

The Blelloch scan uses two phases:

1. Up-sweep (Reduce)

Compute partial sums bottom-up. At each level, pair up adjacent elements and sum them. After log n levels, the root contains the total sum. This is just parallel reduce.

↓

2. Down-sweep

Push partial sums back down the tree. Each node passes its value to its right child and the sum of its value and its left child to its left child. After log n levels, every position has its prefix sum.

Work = Θ(n), Span = Θ(log n), Parallelism = Θ(n / log n)

Scan is the Swiss army knife of parallel computing. It is used in: parallel sorting (radix sort), stream compaction (filtering), polynomial evaluation, solving tridiagonal systems, and even building parallel data structures.

Worked Example: Prefix Sum of [3, 1, 4, 1, 5, 9, 2, 6]

The expected output (inclusive scan) is [3, 4, 8, 9, 14, 23, 25, 31]. Let us trace the Blelloch algorithm step by step.

// Initial: [3, 1, 4, 1, 5, 9, 2, 6]

// UP-SWEEP (reduce phase)
Level 0: pair (0,1), (2,3), (4,5), (6,7)
  [3, 4, 4, 5, 5, 14, 2, 8]
Level 1: pair (1,3), (5,7)
  [3, 4, 4, 9, 5, 14, 2, 22]
Level 2: pair (3,7)
  [3, 4, 4, 9, 5, 14, 2, 31]

// Root now has total sum (31). Set to 0 for exclusive scan:
  [3, 4, 4, 9, 5, 14, 2, 0]

// DOWN-SWEEP (distribute phase)
Level 2: swap and add at (3,7)
  [3, 4, 4, 0, 5, 14, 2, 9]
Level 1: swap and add at (1,3), (5,7)
  [3, 0, 4, 4, 5, 9, 2, 23]
Level 0: swap and add at (0,1), (2,3), (4,5), (6,7)
  [0, 3, 4, 8, 9, 14, 23, 25]

// Exclusive prefix sums! Add original for inclusive.

Why scan matters for GPU computing. Nearly every GPU algorithm uses parallel scan as a building block. Need to compact an array (remove elements where mask=0)? Scan the mask, use the result as output indices. Need to sort on GPU? Radix sort uses scan per bit. Need to allocate variable-length output? Scan the lengths to get starting offsets. Understanding scan is the key to understanding GPU programming.

Parallel Prefix Sum (Blelloch Scan)

Watch the up-sweep build partial sums, then the down-sweep distribute prefix sums. Orange = active at current step. Teal = completed.

Click Run or Step to watch prefix sum

Fork-Join

The fork-join pattern is what we have been studying all along: recursively split a problem, solve subproblems in parallel, combine results. It is the general case that subsumes map and reduce.

// Fork-join pseudocode
PARALLEL-SUM(A, lo, hi):
  if hi - lo ≤ GRAIN_SIZE:
    return sequential_sum(A, lo, hi)
  mid = (lo + hi) / 2
  left = spawn PARALLEL-SUM(A, lo, mid)
  right = PARALLEL-SUM(A, mid, hi)
  sync
  return left + right

// Work = Θ(n), Span = Θ(log n)
// GRAIN_SIZE is crucial for performance!
// Too small (1): spawn overhead dominates
// Too large (n): no parallelism
// Sweet spot: ~1000-10000 elements per task

Grain size matters in practice. CLRS abstracts away spawn/sync overhead as O(1). In reality, spawning a task takes ~100-1000 ns (deque push, possible cache miss). For fine-grained work like adding two numbers (~1 ns), spawning per element is 100-1000x slower than sequential execution! The solution: set a grain size (also called cutoff) below which you switch to sequential. Cilk Plus, TBB, and Rayon all support this. The optimal grain size is typically the point where the sequential work per task equals about 10,000 nanoseconds.

Map-Reduce

Combine map and reduce and you get MapReduce — Google's framework for processing petabytes of data across thousands of machines. The map phase processes data in parallel. A shuffle phase groups results by key. The reduce phase combines each group.

python
from functools import reduce
from concurrent.futures import ProcessPoolExecutor

def word_count_map(document):
    """Map: emit (word, 1) for each word"""
    counts = {}
    for word in document.split():
        counts[word] = counts.get(word, 0) + 1
    return counts

def word_count_reduce(a, b):
    """Reduce: merge two count dicts"""
    result = dict(a)
    for k, v in b.items():
        result[k] = result.get(k, 0) + v
    return result

documents = ["the cat sat", "the dog sat", "the cat ran"]
with ProcessPoolExecutor() as pool:
    mapped = list(pool.map(word_count_map, documents))  # parallel map
total = reduce(word_count_reduce, mapped)               # reduce
# {'the': 3, 'cat': 2, 'sat': 2, 'dog': 1, 'ran': 1}

Pattern	Input → Output	Work	Span	Example
Map	[a₁..a_n] → [f(a₁)..f(a_n)]	Θ(n)	Θ(1)	Image filter
Reduce	[a₁..a_n] → a₁⊕...⊕a_n	Θ(n)	Θ(log n)	Sum, max
Scan	[a₁..a_n] → [prefix sums]	Θ(n)	Θ(log n)	Prefix sum, rank
Map-Reduce	Collection → aggregation	varies	varies	Word count

Concept check: You need to sum 1 million numbers in parallel. Using parallel reduce (binary tree), what is the span?

O(1,000,000) — must add them all sequentially O(log 1,000,000) ≈ O(20) — tree depth is logarithmic O(√1,000,000) = O(1000) O(1) with enough processors

Chapter 7: Modern Parallel Computing

The fork-join model from CLRS is the theoretical foundation. But in practice, parallel computing spans a spectrum of hardware and programming models. Understanding where each one fits is essential for choosing the right tool.

CPU Parallelism: Frameworks and Runtimes

The fork-join model has been implemented in many languages and frameworks. Each one makes slightly different trade-offs between ease of use, overhead, and expressiveness. Here is the landscape:

Technology	Model	Typical P	Best For
OpenMP	Pragma annotations on loops	4-128 cores	Scientific computing, legacy C/Fortran
Intel TBB	Task-based, work-stealing	4-128 cores	C++ applications, dynamic tasks
Rust Rayon	Fork-join with data parallelism	4-128 cores	Safe parallelism (no races by construction)
Go goroutines	M:N green threads	4-128 cores	I/O-bound concurrency, servers
Java ForkJoinPool	Work-stealing	4-128 cores	JVM applications

GPU Parallelism (CUDA / SIMT)

Among these, Rust Rayon deserves special mention. Rust's ownership system guarantees at compile time that parallel tasks cannot have data races. If your code compiles, it is race-free. This is the Cilk dream realized through a type system: the compiler enforces disjoint writes without runtime overhead. Changing .iter() to .par_iter() parallelizes a loop, and if it compiles, it is safe.

GPU Parallelism (CUDA / SIMT)

A GPU is a massively parallel processor with thousands of simple cores. NVIDIA's CUDA programming model uses SIMT (Single Instruction, Multiple Threads): groups of 32 threads (a warp) execute the same instruction simultaneously on different data.

GPUs excel at the map pattern: apply the same function to millions of data points. Matrix multiplication, convolution, and transformer attention are all map-heavy, which is why GPUs dominate deep learning.

A typical NVIDIA H100 GPU has 16,896 CUDA cores and 528 Tensor Cores. Peak throughput: 3,958 TFLOPS for FP8. Compare to a 64-core AMD EPYC CPU at ~10 TFLOPS. That is a 400× difference for data-parallel work. The tradeoff: each GPU core is simple (no branch prediction, small cache), so complex branchy code runs slower on GPU than CPU.

python
# CUDA kernel (conceptual — runs on GPU)
# Each thread handles one output element
# This is the MAP pattern: same function, different data

# Launch config: 256 threads per block, n/256 blocks
# Total threads = n (one per element)

def vector_add_kernel(a, b, c, n):
    idx = blockIdx.x * blockDim.x + threadIdx.x
    if idx < n:
        c[idx] = a[idx] + b[idx]  # each thread does ONE add

# Work = n, Span = 1 (all threads execute in parallel)
# Parallelism = n (embarrassingly parallel)

CPU vs GPU mental model. A CPU is like 16 PhDs who can each solve complex problems independently. A GPU is like 10,000 assembly line workers who all do the same task in lockstep. For complex branchy code, CPUs win. For massive data-parallel work (same operation, millions of elements), GPUs win by 100× or more.

Amdahl's Law

In real programs, some fraction s of the work is inherently sequential (initialization, I/O, synchronization). Amdahl's Law says the maximum speedup is:

Speedup(P) ≤ 1 / (s + (1 - s) / P)

As P → ∞, the speedup approaches 1/s. If 5% of your program is sequential (s = 0.05), the maximum speedup is 20× no matter how many processors you add. This is sobering: a tiny serial fraction caps your speedup. Even with a million processors, you can never go more than 20× faster. The implication: to scale to large P, you must relentlessly minimize the serial fraction.

Amdahl's Law is about software, not hardware. The serial fraction s is a property of your algorithm, not your hardware. Buying more cores does not change s. The only way to increase speedup is to redesign the algorithm to reduce s. This is why parallel algorithm design (this chapter) matters more than parallel hardware: hardware gives you more P, but only algorithmic redesign can lower s.

Gustafson's Law

Amdahl's Law assumes a fixed problem size. Gustafson's Law (1988) takes a different perspective: as you add processors, you also increase the problem size (bigger models, more data). The serial fraction stays constant in absolute time, but the parallel fraction grows. Scaled speedup is:

Speedup(P) = s + P · (1 - s)

This is linear in P. In practice, Gustafson's Law explains why GPU clusters with 10,000 GPUs are useful: you train bigger models, not the same model faster.

Worked Example: Amdahl vs Gustafson

Suppose you have a program that takes 100 seconds on 1 processor: 10 seconds serial, 90 seconds parallelizable. So s = 0.1.

// Amdahl (fixed problem size):
P = 1: Speedup = 1/(0.1 + 0.9/1) = 1.0x (100s)
P = 10: Speedup = 1/(0.1 + 0.9/10) = 5.3x (19s)
P = 100: Speedup = 1/(0.1 + 0.9/100) = 9.2x (10.9s)
P = ∞: Speedup = 1/0.1 = 10.0x (10s)

// Gustafson (scale problem with processors):
// With 100 processors, we solve a 100x bigger problem
// Serial part: still 10s. Parallel part: 90s (on 100 procs)
// On 1 proc, parallel part would take 9000s
P = 100: Speedup = 0.1 + 100 * 0.9 = 90.1x

// Same time, 90x bigger problem = 90x more science done!

The Amdahl vs Gustafson mindset. Amdahl asks: "How fast can I solve THIS problem?" Gustafson asks: "How much bigger a problem can I solve in the SAME time?" In high-performance computing, the Gustafson perspective dominates. Climate scientists do not want the same weather model faster — they want a higher-resolution model in the same 6-hour window. Deep learning researchers do not train GPT-3 faster on more GPUs — they train GPT-4 (a bigger model) in the same wall-clock time.

Amdahl's Law Calculator

Adjust the serial fraction to see how it caps maximum speedup. Left: speedup vs processors. The dashed line shows ideal (linear) speedup for comparison.

Serial fraction s 0.10

Max processors 64

python
def amdahl_speedup(s, P):
    """Amdahl's Law: s = serial fraction, P = processors"""
    return 1.0 / (s + (1 - s) / P)

def gustafson_speedup(s, P):
    """Gustafson's Law: scaled speedup"""
    return s + P * (1 - s)

# Example: 10% serial fraction
for P in [1, 4, 16, 64, 256, 1024]:
    print(f"P={P:4d}: Amdahl={amdahl_speedup(0.1, P):.1f}x, Gustafson={gustafson_speedup(0.1, P):.1f}x")
# P=   1: Amdahl=1.0x, Gustafson=1.0x
# P=   4: Amdahl=3.1x, Gustafson=3.7x
# P=  16: Amdahl=6.4x, Gustafson=14.5x
# P=  64: Amdahl=8.8x, Gustafson=57.7x
# P= 256: Amdahl=9.7x, Gustafson=230.5x
# P=1024: Amdahl=9.9x, Gustafson=921.7x  ← Amdahl caps at 10x!

Concept check: Your program has a serial fraction of 2% (s = 0.02). By Amdahl's Law, what is the maximum possible speedup with unlimited processors?

2× 50× — limit is 1/s = 1/0.02 = 50 Unlimited — more processors always helps 100×

Chapter 8: Interview Arsenal

Cheat Sheet: Work and Span of Common Algorithms

Algorithm	Work T₁	Span T_∞	Parallelism
Parallel sum (reduce)	Θ(n)	Θ(log n)	Θ(n/log n)
Parallel prefix sum (scan)	Θ(n)	Θ(log n)	Θ(n/log n)
Parallel merge sort	Θ(n log n)	Θ(log³ n)	Θ(n/log² n)
Parallel matrix multiply	Θ(n³)	Θ(log n)	Θ(n³/log n)
Parallel BFS	Θ(V+E)	Θ(D · log V)	Θ((V+E)/(D log V))
Parallel map	Θ(n)	Θ(1)	Θ(n)
Parallel for loop	Θ(n · body)	Θ(log n + body)	Θ(n/log n)

Design Questions

"Parallelize this algorithm" — the most common parallel computing interview question. The approach is always the same: (1) identify the work (total operations), (2) draw the dependency graph, (3) find the critical path (span), (4) choose spawn/sync points to expose parallelism, (5) check for races.

Q: Given an n-node binary tree, compute the size of every subtree in parallel.

A: This is a post-order traversal. For each node, subtree_size = 1 + left_size + right_size. The two recursive calls (left and right children) are independent — spawn them. After sync, add 1 + left + right. Work: Θ(n). Span: Θ(height). For a balanced tree, span is Θ(log n), giving parallelism Θ(n / log n). For a degenerate (linked-list) tree, span is Θ(n) and parallelism is 1 — no speedup possible.

PARALLEL-SUBTREE-SIZE(node):
  if node = NIL: return 0
  left = spawn PARALLEL-SUBTREE-SIZE(node.left)
  right = PARALLEL-SUBTREE-SIZE(node.right)
  sync
  node.size = 1 + left + right
  return node.size

// Work: T₁(n) = Θ(n)
// Span: T∞(n) = T∞(max child) + O(1) = Θ(height)
// No races: each node writes only to its own .size field

Q: Parallelize counting sort for n integers in range [0, k].

A: The four-phase approach uses the parallel patterns from Chapter 6:

Phase 1: Local histogram (Map)

Partition input into P chunks. Each processor builds a local histogram of its chunk: count occurrences of each value 0..k. Time: Θ(n/P). No races — each processor reads its own chunk and writes its own histogram.

↓

Phase 2: Merge histograms (Reduce)

Reduce the P local histograms into one global histogram using parallel reduce on each of the k+1 buckets. Work: Θ(k · P). Span: Θ(k · log P).

↓

Phase 3: Prefix sum (Scan)

Parallel prefix sum on the global histogram to compute output positions. Work: Θ(k). Span: Θ(log k). Each bucket's starting index in the output is the sum of all smaller buckets.

↓

Phase 4: Scatter (Map)

Each processor writes its elements to the correct output positions using the prefix sums as offsets. No races — elements with different values go to disjoint output regions.

Total work: Θ(n + k · P). Span: Θ(n/P + k · log P + log k). For k = O(n) and P = O(n/log n), this is efficient.

Q: You have a function that processes web pages. Each page takes 50-200ms to fetch and 1ms to parse. How do you parallelize?

Q: You need to compute the product of all elements in an array. Can you parallelize it?

A: Yes! Multiplication is associative and commutative, so it is a reduce operation. Build a binary reduction tree: pair up elements, multiply pairs in parallel, repeat. Work: Θ(n). Span: Θ(log n). Same as parallel sum but with × instead of +. This generalizes to any associative operation: min, max, AND, OR, XOR, GCD, matrix multiplication (associative but not commutative), string concatenation. The only requirement is associativity: (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c). Commutativity is not needed.

Q: Can you parallelize Dijkstra's shortest-path algorithm?

A: This is a tricky case. Dijkstra's algorithm processes vertices in order of distance, and each iteration depends on the previous (the closest unvisited vertex changes after each relaxation). The algorithm is inherently sequential in structure. However, the relaxation step (updating neighbors) can be parallelized: if the current vertex has d outgoing edges, we can relax all d neighbors in parallel. Work: O((V + E) log V). Span: O(V · (log V + Δ)), where Δ is the max degree. Parallelism is only O(E / (V · Δ)) — modest. For better parallelism, use Δ-stepping or parallel BFS-based approaches.

Q: You have a function that processes web pages. Each page takes 50-200ms to fetch and 1ms to parse. How do you parallelize?

A: This is I/O-bound, not compute-bound. The serial fraction is tiny (parsing is negligible). Use asynchronous I/O (asyncio in Python, goroutines in Go) rather than threads. Launch all fetches concurrently. Amdahl's Law says speedup ≈ P for I/O-bound work because s ≈ 0. With 100 concurrent fetches, you approach 100× speedup. Key insight: for I/O-bound work, you want concurrency (overlap waiting), not parallelism (multiple CPUs). Python's GIL does not matter here because threads spend 99% of their time waiting, not computing. Use asyncio for thousands of concurrent fetches with a single thread.

Quick Work-Span Derivations

These derivations come up constantly in interviews. Practice them until they are automatic.

Parallel for loop over n iterations, each costing O(1):

// parallel for i = 1 to n: A[i] = f(i)
Work = n · O(1) = Θ(n)
Span = O(log n) + O(1) = Θ(log n)
// log n comes from the recursive splitting of the loop range
// split [1,n] into [1,n/2] and [n/2+1,n], recurse, sync
Parallelism = Θ(n / log n)

Two nested parallel for loops (n × n):

// parallel for i = 1 to n:
// parallel for j = 1 to n:
// C[i][j] = compute(i, j) — O(1) per cell
Work = n² · O(1) = Θ(n²)
Span = O(log n) + O(log n) + O(1) = Θ(log n)
  // outer loop split: log n levels, inner loop split: log n levels
  // but inner runs WITHIN each outer iteration — they nest
  // span of outer = log n (split) + span of body
  // span of body = log n (inner split) + O(1)
  // total = log n + log n = O(log n)
Parallelism = Θ(n² / log n)

Parallel for with O(n) body:

// parallel for i = 1 to n:
// B[i] = sum(A[i][1..n]) — O(n) sequential sum
Work = n · O(n) = Θ(n²)
Span = O(log n) + O(n) = Θ(n) // body dominates!
Parallelism = Θ(n)

// FIX: parallelize the inner sum too (reduce)
// parallel for i = 1 to n:
// B[i] = parallel_sum(A[i][1..n]) — O(n) work, O(log n) span
Span = O(log n) + O(log n) = Θ(log n)
Parallelism = Θ(n² / log n) // much better!

Coding Drills

python
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import numpy as np

# Drill 1: Parallel map
def parallel_map(func, data, n_workers=4):
    with ProcessPoolExecutor(max_workers=n_workers) as pool:
        return list(pool.map(func, data))

# Drill 2: Parallel reduce (tree-based)
def parallel_reduce(arr, op):
    if len(arr) == 1: return arr[0]
    mid = len(arr) // 2
    with ThreadPoolExecutor(max_workers=2) as pool:
        left = pool.submit(parallel_reduce, arr[:mid], op)
        right = pool.submit(parallel_reduce, arr[mid:], op)
        return op(left.result(), right.result())

# Drill 3: Parallel prefix sum (sequential, then parallelize)
def prefix_sum(arr):
    """Sequential prefix sum for reference"""
    result = [0] * len(arr)
    result[0] = arr[0]
    for i in range(1, len(arr)):
        result[i] = result[i-1] + arr[i]
    return result

# Drill 4: Work-span analysis
def analyze(work, span, P):
    parallelism = work / span
    tp_bound = max(work / P, span)
    greedy_bound = work / P + span
    speedup = work / greedy_bound
    print(f"Work={work}, Span={span}, P={P}")
    print(f"Parallelism={parallelism:.0f}")
    print(f"Lower bound: {tp_bound:.0f}")
    print(f"Greedy bound: {greedy_bound:.0f}")
    print(f"Speedup: {speedup:.1f}x")

analyze(work=1000000, span=20, P=64)
# Work=1000000, Span=20, P=64
# Parallelism=50000
# Lower bound: 15625
# Greedy bound: 15645
# Speedup: 63.9x  ← near-linear speedup!

Common Interview Mistakes

Mistake	Why It Is Wrong	Correct Thinking
"Just parallelize the inner loop"	Inner loop iterations may have dependencies (loop-carried deps)	Analyze the dependency graph first. Only parallelize truly independent iterations.
"More threads = more speed"	Ignores Amdahl's Law and overhead	Speedup is bounded by serial fraction and limited by T_∞
"Use a lock to fix the race"	Locks serialize access, defeating parallelism	Redesign to avoid shared mutable state. Use disjoint writes + reduce.
"Fork every element"	Spawn overhead per element dominates for small work	Use coarsening: fork into chunks (e.g., 1024 elements per task). This is the grain size optimization.
"Span = depth of recursion tree"	Span includes combine steps, not just recursion	Span = depth × cost-per-level + any serial combine work

The Five-Step Parallelization Recipe

1. Serial algorithm

Write the best serial algorithm first. Verify correctness. Measure its work T₁.

↓

2. Dependency graph

Draw the DAG. Which operations depend on which? Look for independent subproblems.

↓

3. Spawn & sync

Insert spawn where independent work exists. Add sync where results are needed. Check for races.

↓

4. Analyze work & span

Compute T₁ (should match serial). Compute T_∞ using max for parallel branches. Calculate parallelism.

↓

5. Optimize

If parallelism is low, look for a better parallel algorithm (e.g., parallel merge instead of sequential merge). If work increased, check work-efficiency.

Key Formulas to Memorize

T_P ≥ max(T₁/P, T_∞) lower bound

T_P ≤ T₁/P + T_∞ greedy bound

Speedup(P) ≤ 1/(s + (1-s)/P) Amdahl's Law

Speedup(P) = s + P(1-s) Gustafson's Law

Design question: You need to find the maximum element in an array of 1 million numbers on 16 processors. What is the work, span, and expected speedup?

Work Θ(n), span Θ(log n) ≈ 20, speedup ≈ 16× (near-linear since parallelism ≈ 50,000 ≫ 16) Work Θ(n), span Θ(n), speedup ≈ 1× Work Θ(n log n), span Θ(n), speedup ≈ log n Cannot parallelize — must scan sequentially

Chapter 9: Beyond & Connections

Multithreaded algorithms are one layer in a deep stack of parallel computing. The fork-join model we studied is the cleanest and most analyzable, but real systems combine multiple paradigms.

What We Covered vs. What Exists

Topic	This Chapter	In Practice
Model	Fork-join (Cilk)	+ message passing (MPI), BSP, dataflow, actor model
Scheduling	Greedy, work-stealing	+ gang scheduling, space sharing, OS-level CFS
Hardware	Shared memory multicore	+ GPU (SIMT), distributed clusters, TPUs, FPGAs
Correctness	Determinacy races	+ locks, semaphores, monitors, lock-free algorithms, transactional memory
Analysis	Work-span	+ cache complexity, communication complexity, PRAM models

Limitations of the Fork-Join Model

The CLRS model assumes shared memory with uniform access time (any processor can read any memory location equally fast). Real machines have caches, NUMA (Non-Uniform Memory Access), and memory hierarchies. A parallel algorithm that is theoretically optimal can be slow in practice because of cache misses — data is not where the processor needs it. Cache-oblivious algorithms (also in CLRS, Ch 27 problems) address this.

Additionally, the model assumes that spawning a task is free. In practice, each spawn involves allocating a task descriptor, pushing to a deque, and potentially stealing involves inter-core communication. For very fine-grained tasks (e.g., adding two numbers), the spawn overhead dwarfs the computation. The practical fix is coarsening: use sequential execution below a grain size threshold, and only fork for sufficiently large subproblems. Cilk Plus and Rayon handle this automatically with adaptive grain sizes.

The model also ignores communication cost. On distributed systems (multiple machines connected by a network), sending data between nodes takes orders of magnitude longer than a local memory access. Algorithms designed for shared memory do not work well when the "shared" memory is actually a network.

Connections to Other Chapters

Ch 2: Getting Started

Merge sort is the starting point for parallel merge sort. The sequential algorithm's structure — divide in half, merge — maps directly to fork-join.

↓

Ch 4: Divide & Conquer

Every D&C algorithm is a candidate for parallelism. The recursive calls that D&C makes on independent subproblems can be spawned. The Master Theorem analyzes work; adding span analysis gives the full parallel picture.

↓

Ch 28: Matrix Operations

Matrix multiplication is the most important computational kernel in scientific computing and deep learning. Parallel matrix multiply (Ch 27) builds on the sequential methods of Ch 28.

The Parallel Computing Stack

Parallelism exists at every level of the computing stack. Chapter 27 focuses on algorithmic parallelism (the top level), but it helps to know what is below:

Level	Mechanism	Who Controls It
Bit-level	64-bit ALU processes 64 bits at once	Hardware
Instruction-level (ILP)	CPU pipeline, out-of-order execution, superscalar	Hardware + compiler
SIMD/Vector	AVX-512: one instruction on 16 floats at once	Compiler + intrinsics
Thread-level (TLP)	Multiple threads on multiple cores	Programmer (this chapter)
Node-level	Multiple machines, MPI, gRPC	Programmer + framework

A well-optimized matrix multiply exploits ALL five levels simultaneously: SIMD registers process 16 floats per instruction, the CPU pipeline overlaps multiply and add, multiple cores each handle a tile of the output matrix, and across a cluster, each node handles a block of the computation. The theoretical analysis from CLRS (work/span) captures the thread-level parallelism; the other levels are implementation details that affect the constants.

Where Parallel Computing Goes Next

Wafer-scale computing. Cerebras builds a single chip that is an entire 300mm silicon wafer — 850,000 cores, 40 GB of on-chip memory. No off-chip communication. The fork-join model maps naturally to this architecture. The key advantage: span-limited computations (like the layers of a neural network) benefit from zero inter-chip latency. Communication cost, normally the bottleneck in distributed systems, becomes negligible.

Chiplet architectures. AMD's EPYC processors use chiplets — multiple small dies connected on a package. A 128-core EPYC is actually 8 chiplets of 16 cores each. Memory access is non-uniform (NUMA): accessing your own chiplet's memory is fast; accessing another chiplet's memory is 2-3× slower. Algorithms from this chapter assume uniform memory access, but NUMA-aware scheduling can yield significant speedups by keeping data close to the processor that uses it.

Quantum parallelism. A quantum computer with n qubits can represent 2ⁿ states simultaneously. Shor's algorithm for factoring exploits this "parallelism" — though it is fundamentally different from classical parallelism (you cannot observe all 2ⁿ states; you must cleverly extract the answer through interference). Quantum computing does not replace classical parallelism; it addresses fundamentally different problem classes.

Photonic computing. Startups like Lightmatter and Luminous are building processors that use light instead of electrons. A photonic matrix multiplier computes in the time it takes light to pass through a chip (~0.1 ns), regardless of matrix size. This is constant-span matrix multiply — T_∞ = O(1). The catch: precision is limited to about 4-8 bits, making it suitable for inference but not training.

Neural network training. Training GPT-4 required ~25,000 A100 GPUs for 90 days. The parallelism is a mix of three strategies, each mapping directly to concepts from this chapter:

Strategy	CLRS Concept	How It Works
Data parallelism	Map pattern	Each GPU processes a different batch of data, computing gradients independently. Gradients are averaged (reduce) across GPUs. Work-efficient: same total compute as serial training.
Tensor parallelism	Parallel matrix multiply	Large weight matrices are split across GPUs. Each GPU computes a slice of the matrix multiply. Results are gathered (all-reduce). Span limited by communication latency.
Pipeline parallelism	Pipelining (overlap)	Different layers of the network on different GPUs. While GPU 1 processes layer 1 of batch 2, GPU 2 processes layer 2 of batch 1. Reduces idle time (the "bubble") but adds latency per sample.

The critical path through the training pipeline determines the minimum training time. In practice, communication between GPUs (NVLink at 900 GB/s, InfiniBand at 400 Gb/s) is the main bottleneck — a manifestation of span that the pure fork-join model does not capture.

In the work-span framework: training has work proportional to (model size) × (dataset size) × (epochs). The span is proportional to (layers) × (communication rounds per step). Data parallelism reduces work per GPU but adds communication (all-reduce of gradients). Tensor parallelism reduces work per GPU but adds inter-GPU communication per layer. The art of distributed training is minimizing span while keeping work-efficiency close to 1.0.

The Fork-Join Model in Real Languages

python
# Rust (Rayon) — the closest to Cilk's model
# use rayon::prelude::*;
# fn parallel_sum(arr: &[f64]) -> f64 {
#     arr.par_iter().sum()  // parallel reduce
# }

# Go — goroutines (M:N threading, not fork-join but similar)
# func parallelSum(arr []float64) float64 {
#     ch := make(chan float64, 2)
#     mid := len(arr) / 2
#     go func() { ch <- sum(arr[:mid]) }()  // spawn
#     go func() { ch <- sum(arr[mid:]) }()  // spawn
#     return <-ch + <-ch                     // sync
# }

# Java — ForkJoinPool (work-stealing scheduler)
# class SumTask extends RecursiveTask<Long> {
#     protected Long compute() {
#         if (hi - lo < THRESHOLD) return seqSum(arr, lo, hi);
#         int mid = (lo + hi) / 2;
#         SumTask left = new SumTask(arr, lo, mid);
#         left.fork();                    // spawn
#         long right = new SumTask(arr, mid, hi).compute();
#         return left.join() + right;     // sync
#     }
# }

# Python — concurrent.futures (limited by GIL for CPU work)
from concurrent.futures import ProcessPoolExecutor
def parallel_sum(arr, n_workers=4):
    chunk = len(arr) // n_workers
    chunks = [arr[i*chunk:(i+1)*chunk] for i in range(n_workers)]
    with ProcessPoolExecutor(n_workers) as pool:
        return sum(pool.map(sum, chunks))

The legacy of Chapter 27. Fork-join parallelism is not a theoretical curiosity. Every time you use parallel_for in C++, .par_iter() in Rust, or ForkJoinPool in Java, you are using the exact model from this chapter. Work-stealing is the scheduler behind all of them. The work-span framework is how you reason about their performance. This theory is the foundation of every practical parallel system in use today.

Summary of Key Ideas

#	Concept	One-Line Summary
1	Fork-Join Model	Spawn parallel tasks, sync when you need results
2	Work (T₁)	Total operations = serial running time
3	Span (T_∞)	Critical path = lower bound with infinite processors
4	Parallelism	T₁/T_∞ = max useful processors
5	Greedy bound	T_P ≤ T₁/P + T_∞
6	Work-stealing	Idle processors steal large tasks from busy ones
7	Parallel merge	Median splitting + binary search reduces merge span to O(log² n)
8	Determinacy races	Two parallel accesses to same location, one is write = nondeterminism
9	Amdahl's Law	Serial fraction s caps speedup at 1/s
10	Scan (prefix sum)	The Swiss army knife: O(n) work, O(log n) span

"The way to go fast is not to do less work, but to do work at the same time." — Charles Leiserson (Cilk co-inventor, CLRS co-author)