Introduction to Algorithms (CLRS) — Chapter 23

Minimum Spanning Trees

Kruskal, Prim, Union-Find — connecting everything at minimum cost.

Prerequisites: Graphs + Greedy intuition. That's it.
10
Chapters
9+
Simulations
5
Interview Dimensions

Chapter 0: The Problem

You are an engineer for a telecommunications company. Five towns need to be connected with fiber-optic cable. You have surveyed every possible route between every pair of towns and know the exact cost of laying cable along each route. Your budget is tight. You need every town to be reachable from every other town, but you want to spend as little money as possible on cable.

There are potentially ten routes between five towns (every pair). You do not need all ten. In fact, you know from graph theory that connecting five towns requires exactly four cables — one fewer than the number of towns. Any fewer and some town is disconnected. Any more and you are paying for a redundant connection.

But which four? There are many possible sets of four edges that keep all towns connected. Some cost much more than others. You want the cheapest set. This is the minimum spanning tree problem.

Key Definitions

A spanning tree of a connected graph G = (V, E) is a subgraph that includes every vertex and is a tree — meaning it is connected and has no cycles. A spanning tree always has exactly |V| - 1 edges. Think of it as the skeleton of the graph: the minimum wiring to keep everything connected.

A minimum spanning tree (MST) is a spanning tree whose total edge weight is as small as possible. If all edge weights are distinct, the MST is unique. If some weights tie, there may be multiple MSTs, but they all have the same total weight.

The core insight. An MST connects all vertices with the minimum total cost. It is NOT a shortest-path tree — the path between two vertices in the MST may not be the shortest path in the original graph. MST minimizes total wire. Shortest-path trees minimize individual routes. Different goals, different algorithms, different answers.

See It: Cities and Cable

The canvas below shows five cities with all possible connections. Each edge has a cost. Click "Find MST" to watch the minimum spanning tree emerge. Notice: we use only 4 of the 10 possible edges, and their total cost is the smallest possible.

City Network — Find the MST

Click "Find MST" to highlight the cheapest set of edges connecting all cities. Click "Reset" to try again.

Total possible edges: 10. MST needs exactly 4.

Why Not Just Pick the Cheapest Edges?

A tempting first idea: sort all edges by weight and pick the four cheapest. But this can fail. The four cheapest edges might all connect the same three towns, leaving two towns stranded. Or the four cheapest might form a cycle, wasting a connection.

Let us see this concretely. Suppose five cities have these edge costs: A-B=1, A-C=1, B-C=2, C-D=5, D-E=6, A-D=7, B-E=8, A-E=9, C-E=10. The four cheapest edges are A-B(1), A-C(1), B-C(2), C-D(5). But A-B, A-C, and B-C form a triangle — a cycle. Adding all three wastes one edge on a redundant connection. D and E both end up disconnected or poorly served.

We need a strategy that is greedy — always picking cheap edges — but also careful — never creating a cycle, never leaving a town disconnected. This is exactly what Kruskal's and Prim's algorithms do, and the theoretical foundation for why they work is the cut property, which we will prove in Chapter 1.

How Many Spanning Trees Exist?

Cayley's formula tells us a complete graph on n vertices has nn-2 spanning trees. For 5 vertices: 53 = 125 spanning trees. For 10 vertices: 108 = 100 million. For 20 vertices: 2018 ≈ 2.6 × 1023. Brute force is impossible even for small graphs. We need an algorithm that finds the minimum without examining all candidates. Greedy does this in O(E log E) time.

The three constraints. A valid MST must satisfy: (1) includes every vertex, (2) is connected (one piece), (3) has no cycles. Together, these force exactly V - 1 edges. The "minimum" part means the total weight of these V - 1 edges is as small as possible.

Tree Properties: A Quick Refresher

A tree on n vertices is a connected acyclic graph. The following statements are ALL equivalent (any one implies all the others):

#Property
1T is connected and has no cycles
2T is connected and has exactly n - 1 edges
3T has no cycles and has exactly n - 1 edges
4There is exactly one path between every pair of vertices
5T is connected, but removing any edge disconnects it
6T has no cycles, but adding any edge creates exactly one cycle

Property 6 is crucial for MST algorithms. When Kruskal considers adding an edge and both endpoints are in the same component, that edge would create a cycle (property 6). When we add an edge to an MST for the exchange argument, it creates exactly one cycle (property 6 again). These properties are the hidden scaffolding behind every MST proof.

Weighted Graphs

An MST only makes sense for weighted graphs — graphs where each edge has a numerical weight (cost, distance, bandwidth, etc.). For unweighted graphs, all spanning trees have the same "cost" (V - 1 edges), so any spanning tree is an MST. The interesting case is when weights differ.

Weights can be negative. The MST still makes sense — we minimize total weight, so negative edges are "free money" (they reduce cost). All MST algorithms work correctly with negative weights. This is different from shortest paths, where negative weights cause problems for Dijkstra (but not Bellman-Ford).

MST Is Not Shortest-Path Tree

This is the single most common misconception, so let us nail it down with a concrete example before we go any further.

Consider three cities: A, B, C. Direct routes: A-B costs 1, B-C costs 1, A-C costs 3.

QuestionAnswerEdges UsedTotal Cost
MST (cheapest to connect all)Use A-B and B-CA-B(1), B-C(1)2
Shortest path A to CGo A-B-C via MSTPath weight: 1+1 = 22
Shortest-path tree from AA-B(1), A-C(3) gives shortest individual pathsA-B(1), A-C(3)4

The MST total cost is 2. The shortest-path tree from A total cost is 4 — worse! But the shortest-path tree guarantees each individual path (A-to-B, A-to-C) is optimal. The MST does not make this guarantee. In this example, the MST path from A to C (A→B→C = 2) happens to be shorter than the direct route (A-C = 3). But that is a coincidence. With different weights, the MST path can be much longer than the shortest path.

Change the weights: A-B = 10, B-C = 10, A-C = 3. Now the MST is A-C(3) + B-C(10) = 13. The MST path from A to B is A→C→B = 13. But the shortest path from A to B is the direct edge A-B = 10. The MST path is 30% longer. MST minimizes total infrastructure cost. Shortest-path trees minimize individual route lengths. Different optimization objectives, different answers.

Concept check: A connected graph has 8 vertices and 15 edges. How many edges does ANY spanning tree of this graph have?

Chapter 1: The Cut Property

Before we learn any MST algorithm, we need to understand WHY greedy works here. Not all optimization problems yield to greedy strategies — knapsack, for instance, does not. The reason MST does is a beautiful theorem called the cut property.

What Is a Cut?

A cut of a graph G = (V, E) is a partition of the vertices into two non-empty sets S and V \ S. Think of drawing a line across the graph, putting some vertices on one side and the rest on the other. Every edge with one endpoint in S and the other in V \ S is a crossing edge.

A cut respects a set of edges A if no edge in A crosses the cut. A crossing edge (u, v) is a light edge if its weight is the minimum among all crossing edges.

The Theorem

Let A be a subset of edges contained in some MST. Let (S, V \ S) be any cut that respects A. Let (u, v) be a light edge crossing this cut. Then (u, v) is safe for A — meaning A ∪ {(u, v)} is still contained in some MST.

In plain English: if you draw a line across the graph and no edge you have already chosen crosses that line, then the cheapest edge crossing that line is always a safe choice for the MST.

The Generic MST Algorithm

CLRS presents a beautiful abstraction: the generic MST algorithm. Both Kruskal and Prim are special cases of this single template.

GENERIC-MST(G, w):
  A = {}
  while A does not form a spanning tree:
    find a cut (S, V\S) that respects A
    let (u, v) be a light edge crossing the cut
    A = A ∪ {(u, v)}
  return A

The theorem guarantees that at every step, there EXISTS such a cut and such a light edge. The specific algorithms differ only in HOW they choose the cut:

AlgorithmHow It Chooses the Cut
KruskalS = the component containing one endpoint of the lightest remaining edge
PrimS = the vertices already in the growing tree
BoruvkaS = each component independently (all cuts simultaneously)

This unification is elegant because it separates the correctness proof (the cut property) from the efficiency analysis (which depends on data structures). Any algorithm that follows the generic template and uses the cut property correctly will produce an MST. The only question is how fast it can find the right cut and the light edge.

Why this matters. The cut property is the theoretical engine behind BOTH Kruskal's and Prim's algorithms. Kruskal exploits it by treating each connected component as one side of a cut. Prim exploits it by treating the growing tree as one side and everything else as the other side. Both are just different instantiations of the same theorem.

Proof by Exchange Argument

The proof is elegant and uses a technique you will see again and again in algorithm design: the exchange argument. Here is how it works.

Assume A ⊆ T for some MST T. Let (u, v) be a light edge crossing cut (S, V \ S) that respects A. We want to show A ∪ {(u, v)} is contained in some MST.

Case 1: (u, v) ∈ T. Then A ∪ {(u, v)} ⊆ T, and we are done.

Case 2: (u, v) ∉ T. Adding (u, v) to T creates exactly one cycle (because T is a tree, so adding any edge creates exactly one cycle). This cycle must contain another edge (x, y) that crosses the same cut (S, V \ S), because the cycle enters S through one crossing edge and must leave through another.

Since (u, v) is a light edge, w(u, v) ≤ w(x, y). Remove (x, y) from T and add (u, v). Call the result T'. Then:

w(T') = w(T) - w(x, y) + w(u, v) ≤ w(T)

Since T was an MST, w(T') ≥ w(T). Combined with w(T') ≤ w(T), we get w(T') = w(T). So T' is also an MST, and A ∪ {(u, v)} ⊆ T'. Done.

The exchange trick. We did not prove that (u, v) is in every MST. We proved that if it is not in a particular MST, we can swap it in by removing another crossing edge without increasing the total weight. This exchange argument is the gold standard for proving greedy correctness.

The Corollary: Unique Lightest Crossing Edge

If the light edge crossing a cut is unique (no ties), then it is in EVERY MST. The proof is the same exchange argument: if some MST T does not contain it, we swap it in and get a strictly lighter tree (w(u,v) < w(x,y) since there is no tie), contradicting T being an MST. This is why distinct edge weights guarantee a unique MST.

The Cycle Property (Dual of Cut Property)

There is a dual result: the cycle property. For any cycle C in the graph, the heaviest edge in C is NOT in any MST (assuming distinct weights). Proof: suppose the heaviest edge e = (u,v) in cycle C is in some MST T. Remove e from T, splitting it into two components S and V\S. The cycle C must contain another edge e' crossing this same cut (the cycle enters and leaves S). Since e is the heaviest edge in C, w(e') < w(e). But then T - {e} + {e'} is a lighter spanning tree, contradicting T being an MST.

Cut property + Cycle property = complete MST toolkit. The cut property tells you which edges MUST be in the MST (lightest crossing a cut). The cycle property tells you which edges CANNOT be in the MST (heaviest in a cycle). Together, they provide both inclusion and exclusion rules. Kruskal uses the cut property to include edges. When Kruskal skips an edge because it would create a cycle, it is implicitly applying the cycle property — that edge is the heaviest in the cycle it would create (since all lighter edges in the cycle were already added).

Worked Example: Using the Cut Property

Consider a graph with vertices {A, B, C, D} and edges: A-B(2), A-C(3), B-C(1), B-D(4), C-D(5).

Start with A = {} (no MST edges yet). Consider the cut ({A}, {B, C, D}). It respects A (trivially, since A is empty). Crossing edges: A-B(2), A-C(3). The light edge is A-B(2). By the cut property, A-B is safe. Add it: A = {A-B}.

Now consider the cut ({A, B}, {C, D}). Does it respect A? The only edge in A is A-B, which has both endpoints in {A, B}, so yes. Crossing edges: A-C(3), B-C(1), B-D(4), C-D(5). Wait — C-D has both endpoints in {C, D}, so it does not cross. Crossing edges: A-C(3), B-C(1), B-D(4). Light edge: B-C(1). Add it: A = {A-B, B-C}.

Now consider the cut ({A, B, C}, {D}). Respects A? A-B has both ends in the left set. B-C has both ends in the left set. Yes. Crossing edges: B-D(4), C-D(5). Light edge: B-D(4). Add it: A = {A-B, B-C, B-D}.

We have 3 = V-1 edges. MST complete. Total weight = 2+1+4 = 7. Each step was justified by the cut property applied to a different cut.

See It: Cuts and Crossing Edges

The canvas below shows a graph with a cut (orange dashed line). Vertices on each side are colored differently. All crossing edges are highlighted. The lightest crossing edge is marked. Try different cuts with the buttons to see how the light edge changes.

Cut Property Visualizer

Click "Next Cut" to cycle through different cuts. The lightest crossing edge (safe to add) is shown in green.

Cut 1 of 4
Concept check: You have an MST-in-progress with edges A. You find a cut that respects A. There are three crossing edges with weights 5, 3, and 7. Which edge is safe to add?

Chapter 2: Kruskal's Algorithm

Kruskal's algorithm is beautifully simple. Sort all edges by weight. Go through them from lightest to heaviest. For each edge, if it connects two different components (does not create a cycle), add it to the MST. Otherwise, skip it. Stop when you have V - 1 edges.

The Algorithm Step by Step

1. Sort
Sort all E edges by weight in non-decreasing order. Cost: O(E log E).
2. Initialize
Create a disjoint-set (Union-Find) with each vertex in its own set. V components total.
3. Iterate
For each edge (u, v) in sorted order: if Find(u) ≠ Find(v), add (u, v) to MST and Union(u, v). Otherwise skip — they are already connected.
4. Done
After V - 1 edges added, the MST is complete.

Why It Works (via Cut Property)

When Kruskal considers edge (u, v), suppose u and v are in different components Cu and Cv. Define the cut as (Cu, V \ Cu). This cut respects all previously chosen MST edges (because no chosen edge crosses component boundaries — each component is a tree of chosen edges). Edge (u, v) crosses this cut. Because edges are sorted by weight, (u, v) is the lightest edge we have seen that crosses this cut. By the cut property, it is safe to add.

Complexity Analysis

// Sorting dominates
T(n) = O(E log E) + O(E · α(V)) + O(V)

// Since E ≤ V², we have log E ≤ 2 log V, so O(E log E) = O(E log V)
// α(V) is the inverse Ackermann function — effectively constant
// Total: O(E log E) = O(E log V)

The bottleneck is sorting. Let us count operations precisely:

OperationCountCost EachTotal
Sort edges1O(E log E)O(E log E)
MakeSetVO(1)O(V)
Find (cycle check)2E (two per edge)O(α(V))O(E α(V))
UnionV - 1O(α(V))O(V α(V))
TotalO(E log E)

Since E ≤ V2 for simple graphs, log E ≤ 2 log V, so O(E log E) = O(E log V). Since E ≥ V - 1 for connected graphs, the sorting dominates all other costs.

Can we do better? If edge weights are integers in range [0, W], we can use radix sort in O(E + W) instead of comparison sort in O(E log E). This makes Kruskal O(E α(V)) — nearly linear. For floating-point weights, comparison sort is necessary and O(E log E) is a lower bound.

Kruskal with Pre-Sorted Edges

Sometimes the edges come pre-sorted (from a Delaunay triangulation, for instance, or from reading a sorted file). In that case, Kruskal's sorting step is free, and the total runtime is O(E α(V)) — essentially linear. This makes Kruskal the fastest known algorithm for pre-sorted edge lists.

Correctness Argument: Why Kruskal Never Makes a Mistake

Claim: every edge Kruskal adds is in some MST. Every edge Kruskal skips is in no MST.

Edge added: When Kruskal adds edge (u, v), vertices u and v are in different components Cu and Cv. The cut (Cu, V \ Cu) respects all previously chosen edges (all chosen edges have both endpoints within the same component). Edge (u, v) is the lightest crossing edge (if there were a lighter crossing edge, Kruskal would have processed it first, since edges are sorted). By the cut property, (u, v) is safe.

Edge skipped: When Kruskal skips edge (u, v), both u and v are in the same component. The previously added edges form a tree within this component. Adding (u, v) would create a cycle. The cycle contains (u, v) and some subset of already-added edges. Since all previously added edges have weight ≤ w(u, v) (Kruskal processes in sorted order), (u, v) is the heaviest edge in the cycle. By the cycle property, (u, v) is in no MST.

This is a complete correctness proof: every edge is either added (correctly) or skipped (correctly). No edge is mislabeled.

Worked Example

Consider this graph with 6 vertices (A-F) and 9 edges:

EdgeWeightActionComponents After
A-B1Add (A≠B){A,B} {C} {D} {E} {F}
C-E2Add (C≠E){A,B} {C,E} {D} {F}
D-F2Add (D≠F){A,B} {C,E} {D,F}
A-C3Add (A≠C){A,B,C,E} {D,F}
B-C4Skip (B=C, same component){A,B,C,E} {D,F}
E-F4Add (E≠F){A,B,C,E,D,F}
Stop — 5 edges added (V-1 = 5). MST weight = 1+2+2+3+4 = 12.

See It: Kruskal Step-Through

Click "Step" to add the next edge (if valid) or skip it (if it creates a cycle). Watch the Union-Find state below the graph.

Kruskal's Algorithm — Step by Step

Each step considers the next lightest edge. Green = added, red = skipped (would create cycle).

Ready. 0 / 5 edges in MST.

Implementation

python
class UnionFind:
    def __init__(self, n):
        self.parent = list(range(n))
        self.rank = [0] * n

    def find(self, x):
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])  # path compression
        return self.parent[x]

    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry: return False
        if self.rank[rx] < self.rank[ry]: rx, ry = ry, rx
        self.parent[ry] = rx  # union by rank
        if self.rank[rx] == self.rank[ry]: self.rank[rx] += 1
        return True

def kruskal(n, edges):
    """n = number of vertices, edges = [(w, u, v), ...]"""
    edges.sort()  # sort by weight
    uf = UnionFind(n)
    mst = []
    for w, u, v in edges:
        if uf.union(u, v):
            mst.append((w, u, v))
            if len(mst) == n - 1:
                break
    return mst
Data flow. Input: n (int), edges list of (weight, u, v) tuples. Output: list of (weight, u, v) tuples forming the MST. Total weight = sum of first elements. The Union-Find tracks component membership. Each find() call returns the root of the component tree. Each union() call merges two component trees.

What If the Graph Is Disconnected?

If the graph is not connected, no spanning tree exists (you cannot span all vertices with a tree if some are unreachable). Kruskal will add fewer than V-1 edges and stop. You can detect this: if len(mst) < n - 1 after processing all edges, the graph is disconnected. The edges Kruskal DID add form a minimum spanning forest — a collection of MSTs, one per connected component.

Handling Duplicate Weights

When multiple edges share the same weight, the sort order among them is arbitrary. Different orderings can produce different MSTs, but ALL will have the same total weight. This is because the multiset of edge weights in ANY MST of a graph is unique (even though the specific edges may differ). You can prove this using the matroid intersection theorem, but the intuition is: if you could get a lighter multiset, you would have a lighter MST.

Concept check: In Kruskal's algorithm, you are about to add edge (u, v) with weight 5. Find(u) returns 3 and Find(v) returns 3. What do you do?

Chapter 3: Prim's Algorithm

While Kruskal thinks about edges ("which edge is cheapest globally?"), Prim thinks about vertices ("which vertex is cheapest to connect next?"). Prim grows a single tree from a starting vertex, always adding the cheapest edge that connects a tree-vertex to a non-tree-vertex.

The Algorithm Step by Step

1. Initialize
Pick any starting vertex s. Set key[s] = 0, key[v] = ∞ for all others. Insert all vertices into a min-priority queue keyed by key[v]. Set parent[v] = NIL for all v.
2. Extract-Min
Remove the vertex u with the smallest key from the priority queue. Vertex u joins the MST tree.
3. Update Neighbors
For each neighbor v of u still in the queue: if w(u, v) < key[v], update key[v] = w(u, v) and parent[v] = u. (This is called "relaxation.")
↻ repeat 2-3 until queue is empty

Why It Works (via Cut Property)

At each step, the vertices already extracted form a set S. The vertices still in the queue form V \ S. This is a cut. The edges from S to V \ S have their cheapest tracked in the priority queue via key[v]. When we extract the minimum, we are choosing the light edge crossing the cut (S, V \ S). The cut respects the MST edges chosen so far (all chosen edges connect vertices within S). By the cut property, this light edge is safe.

Complexity Analysis

// With a binary heap:
T(n) = O(V · log V) + O(E · log V)
= O((V + E) log V)

// V Extract-Min operations: O(V log V)
// Up to E Decrease-Key operations: O(E log V)

// With a Fibonacci heap:
T(n) = O(V log V + E)
// Decrease-Key becomes O(1) amortized

Let us count operations for Prim with a binary heap:

OperationCountBinary Heap CostFibonacci Heap Cost
InsertVO(log V)O(1)
Extract-MinVO(log V)O(log V) amortized
Decrease-KeyEO(log V)O(1) amortized
TotalO((V+E) log V)O(E + V log V)

For dense graphs where E = Θ(V2), Prim with a Fibonacci heap gives O(V2 + V log V) = O(V2), which beats Kruskal's O(V2 log V). For sparse graphs where E = O(V), Kruskal's O(V log V) matches Prim's O(V log V). In practice, binary heap Prim and Kruskal perform similarly because Fibonacci heaps have high constant factors.

The adjacency-matrix trick. For very dense graphs, Prim can be implemented without any priority queue: maintain a key[] array and find the minimum by linear scan in O(V) time. Total: O(V2). No heap needed. This is simpler and faster than heap-based Prim when E = Θ(V2). It is equivalent to Dijkstra's O(V2) implementation for dense graphs.

Worked Example

Same graph as Kruskal (6 vertices A-F). Start from vertex A:

StepExtractKeyEdge AddedQueue Updates
1A (key=0)0— (start)B:1, C:3 updated
2B (key=1)1A-B (w=1)C: min(3,4)=3, D:6 updated
3C (key=3)3A-C (w=3)E:2 updated, D: min(6,5)=5
4E (key=2)2C-E (w=2)F:4 updated
5F (key=4)4E-F (w=4)D: min(5,2)=2
6D (key=2)2F-D (w=2)Queue empty

MST edges: A-B(1), A-C(3), C-E(2), E-F(4), F-D(2). Total = 12. Same total weight as Kruskal, but possibly different edges if weights tied.

Understanding key[] vs distance[]

This is the conceptual core of Prim. The key[v] array stores the cheapest single edge connecting v to any vertex already in the tree. It does NOT store the total path cost from the start vertex to v. This is the fundamental difference from Dijkstra.

When vertex u is extracted and we examine its neighbor v, we ask: "Is the direct edge (u,v) cheaper than v's current best connection to the tree?" If yes, update key[v] = w(u,v) and remember parent[v] = u. This is a comparison of single edge weights, not accumulated distances.

This means key[v] can decrease when we find a better single-edge connection, but it has nothing to do with how far v is from the starting vertex. A vertex at the "opposite end" of the graph might have key = 1 if there is a cheap edge connecting it to the tree.

See It: Prim Step-Through

Click "Step" to extract the next vertex and grow the tree. The frontier edges (candidates for next addition) are shown as dashed lines. The cheapest frontier edge is highlighted.

Prim's Algorithm — Step by Step

The tree grows from vertex A. Frontier edges shown dashed. Cheapest frontier edge in green.

Ready. Tree: {A}. 0 edges.

Implementation

python
import heapq

def prim(n, adj):
    """n = vertices, adj = {u: [(w, v), ...]} adjacency list"""
    in_mst = [False] * n
    key = [float('inf')] * n
    parent = [-1] * n
    key[0] = 0
    pq = [(0, 0)]  # (key, vertex)
    mst_edges = []

    while pq:
        k, u = heapq.heappop(pq)
        if in_mst[u]: continue  # already extracted
        in_mst[u] = True
        if parent[u] != -1:
            mst_edges.append((k, parent[u], u))

        for w, v in adj[u]:
            if not in_mst[v] and w < key[v]:
                key[v] = w
                parent[v] = u
                heapq.heappush(pq, (w, v))  # lazy decrease-key

    return mst_edges
Lazy vs eager decrease-key. Python's heapq does not support decrease-key. Instead we push duplicates and skip stale entries (the if in_mst[u]: continue line). This is the "lazy" approach. It uses O(E) space in the heap but keeps the code simple. The time complexity remains O(E log E) = O(E log V). In production C++ code, you would use an indexed priority queue for true O(E log V).

Prim for Dense Graphs (No Heap)

python
def prim_dense(n, weight):
    """weight[u][v] = edge weight (inf if no edge). O(V^2)."""
    in_mst = [False] * n
    key = [float('inf')] * n
    parent = [-1] * n
    key[0] = 0

    for _ in range(n):
        # Find min-key vertex not yet in MST (linear scan)
        u = -1
        for v in range(n):
            if not in_mst[v] and (u == -1 or key[v] < key[u]):
                u = v
        in_mst[u] = True

        # Relax all neighbors
        for v in range(n):
            if not in_mst[v] and weight[u][v] < key[v]:
                key[v] = weight[u][v]
                parent[v] = u

    # Reconstruct MST edges
    mst = [(key[v], parent[v], v) for v in range(1, n)]
    return mst
When to use this. LeetCode 1584 (Min Cost to Connect All Points) gives n points and you must connect them all with minimum total Manhattan distance. The complete graph has n2/2 edges. With n = 1000, Kruskal needs to sort 500,000 edges. Prim-dense needs 10002 = 1,000,000 operations with no sorting. Prim-dense is 5x faster for this specific problem. This is why knowing both algorithms matters — each has its niche.

Prim vs Dijkstra: Same Shape, Different Meaning

If you have already studied Dijkstra's shortest-path algorithm, Prim will look eerily familiar. Both algorithms:

1. Init
Set source to 0, all others to ∞. Insert into priority queue.
2. Extract-Min
Pop the vertex with smallest key/distance.
3. Relax Neighbors
Update neighbors still in the queue if a better connection is found.
↻ repeat

The difference is in step 3. Prim asks: "Is the edge weight w(u,v) less than key[v]?" Dijkstra asks: "Is the path distance d[u] + w(u,v) less than d[v]?" Prim cares about the cheapest single edge. Dijkstra cares about the cheapest total path. One builds an MST. The other builds a shortest-path tree. Same skeleton, different bone density.

Starting Vertex Does Not Matter

Prim starts from an arbitrary vertex. Different starting vertices may produce different MSTs (when weights tie), but they always produce the same total weight. If all edge weights are distinct, the MST is unique regardless of starting vertex. This is because the MST is a property of the graph, not the algorithm.

Prim's Algorithm Correctness

At each step, Prim maintains a tree T rooted at the start vertex. The cut (V(T), V \ V(T)) respects T (all edges of T have both endpoints in V(T)). The vertex extracted from the priority queue is connected to V(T) by the lightest crossing edge (that is what key[v] stores). By the cut property, this edge is safe.

The loop invariant: at the start of each iteration, the set of edges connecting extracted vertices forms a tree that is a subset of some MST. Since we add V-1 edges total and each is safe, the final result is an MST. This is the same invariant as the generic MST algorithm, just instantiated with the "growing tree" cut.

Space Complexity Comparison

AlgorithmGraph StorageExtra SpaceTotal
KruskalO(E) edge listO(V) Union-FindO(V + E)
Prim (heap)O(V + E) adj listO(V) key/parent + O(E) lazy heapO(V + E)
Prim (dense)O(V2) adj matrixO(V) key/parentO(V2)

Kruskal requires the edges as a list (for sorting). Prim requires an adjacency list or matrix (for neighbor iteration). If you are given one format and need the other, the conversion is O(V + E). In interviews, mention which format your algorithm expects and whether the input needs conversion.

Concept check: In Prim's algorithm, vertex u is extracted from the priority queue with key[u] = 7. Its neighbor v is still in the queue with key[v] = 4. The edge weight w(u, v) = 3. What happens?

Chapter 4: Kruskal vs Prim

Both algorithms produce optimal MSTs. They differ in perspective, data structures, and performance characteristics. Knowing when to use which is an interview favorite.

Side-by-Side Comparison

PropertyKruskalPrim
PerspectiveEdge-centric ("cheapest edge globally")Vertex-centric ("cheapest to connect next")
Growth patternForest of trees that mergeSingle tree that grows
Data structureUnion-FindMin-priority queue
Time (binary heap)O(E log E)O((V+E) log V)
Time (Fibonacci heap)O(E log E)O(E + V log V)
Best for sparseYes — sort dominates, E is smallEqual
Best for denseSlower — E ≈ V2, sorting V2 edgesYes — with Fib heap, O(V2)
Needs edge listYes (natural input format)No (adjacency list)
ParallelizableSomewhat — sort in parallelHarder — sequential extract-min
ImplementationSimpler (Union-Find is short)Slightly more complex
Interview rule of thumb. Default to Kruskal for interviews — the code is shorter and Union-Find is easier to implement from scratch. Use Prim when you are given an adjacency list or the problem is about dense graphs. If the interviewer asks you to optimize, mention Fibonacci heaps for Prim and note that Kruskal cannot benefit from them.

See It: Both Algorithms Side by Side

The canvas below runs both algorithms on the same graph simultaneously. Watch how Kruskal processes edges globally while Prim grows a tree locally. Same result, different journey.

Kruskal vs Prim — Side by Side

Click "Step Both" to advance both algorithms. Left = Kruskal, Right = Prim.

Ready.

When MSTs Differ

If all edge weights are distinct, both algorithms produce the same unique MST. When weights tie, they may produce different MSTs (same total weight, different edge sets). Kruskal's tie-breaking depends on sort stability and input order. Prim's depends on starting vertex and tie-breaking in the priority queue.

Subtle point: The MST itself may not be unique, but the sorted sequence of edge weights in the MST is always the same. If your MST has edges with weights [1, 2, 3, 4, 5], every other MST of the same graph also has weights [1, 2, 3, 4, 5] (counted with multiplicity). This is the MST weight uniqueness theorem, provable via the matroid theory generalization of MSTs.

Processing Order Comparison

To see the fundamental difference, let us trace both algorithms on the same graph with 4 vertices {A,B,C,D} and edges AB(1), AC(4), AD(3), BC(2), BD(5), CD(6).

Kruskal's order (by edge weight): AB(1)→BC(2)→AD(3)→AC(4)→BD(5)→CD(6)

EdgeWeightFind(u), Find(v)ActionComponents
A-B1A, BAdd{A,B} {C} {D}
B-C2A, C (A is root of B)Add{A,B,C} {D}
A-D3A, DAdd{A,B,C,D}
Done. MST = {AB(1), BC(2), AD(3)}. Weight = 6.

Prim's order (starting from A, by key value):

ExtractKeyEdge AddedQueue After Updates
A0B:1, C:4, D:3
B1A-B(1)C: min(4,2)=2, D: min(3,5)=3
C2B-C(2)D: min(3,6)=3
D3A-D(3)(empty)

Same MST! Both found {AB(1), BC(2), AD(3)} with weight 6. But look at the processing order: Kruskal processed edges globally sorted. Prim grew from A outward, processing vertices in order of their connection cost.

Numerical Comparison

Let us compare the two algorithms on concrete graph sizes to build intuition for when each is faster:

Graph TypeVEKruskal O(E log E)Prim-heap O((V+E)log V)Prim-dense O(V2)Winner
Sparse tree-like10002000~22K~30K1MKruskal
Medium density100050K~780K~510K1MPrim-heap
Dense (complete)1000500K~9.5M~5M1MPrim-dense
Huge sparse1M3M~65M~60M1012Kruskal

The pattern: Kruskal wins for sparse graphs. Prim-heap wins for medium density. Prim-dense (no heap) wins for complete or near-complete graphs. Knowing all three gives you the right tool for every graph density.

Concept check: You have a connected graph with 1000 vertices and 2000 edges. Which algorithm is likely faster in practice?

Chapter 5: Union-Find (Disjoint Sets)

Kruskal needs to answer one question fast: "Are u and v in the same component?" If they are, the edge creates a cycle and we skip it. If not, we add the edge and merge the components. The data structure that does this nearly instantaneously is Union-Find (also called the disjoint-set data structure).

Three Operations

MakeSet(x): Create a new set containing just x. Each element starts as its own singleton. Think: every town is its own island.

Find(x): Return the representative (root) of the set containing x. Two elements are in the same set if and only if they have the same representative. Think: "Who is the mayor of x's island?"

Union(x, y): Merge the sets containing x and y into one. Think: build a bridge between two islands, making them one.

Naive Implementation (and Why It's Slow)

The simplest version: each element has a parent pointer. The root of each tree is the representative (its own parent). Find follows parent pointers to the root. Union makes one root point to the other.

Problem: without care, the trees become tall chains. If you Union 1→2→3→4→...→n, then Find(1) takes O(n) time. After m operations, total cost is O(mn). Terrible.

Optimization 1: Union by Rank

Keep a rank for each root (roughly, the height of the tree). When unioning, make the shorter tree point to the taller tree. This guarantees tree height is at most O(log n).

// Union by rank rule:
if rank[root_x] < rank[root_y]: parent[root_x] = root_y
elif rank[root_x] > rank[root_y]: parent[root_y] = root_x
else: parent[root_y] = root_x; rank[root_x] += 1

// With union by rank alone: Find is O(log n)

Optimization 2: Path Compression

During Find(x), after we walk up to the root, make every node on the path point directly to the root. Next time we call Find on any of those nodes, it returns in O(1).

// Path compression (recursive):
def find(x):
  if parent[x] != x:
    parent[x] = find(parent[x]) # flatten the tree
  return parent[x]

Combined: Nearly O(1)

With both union by rank and path compression, any sequence of m operations on n elements takes O(m · α(n)) time, where α is the inverse Ackermann function. This function grows so slowly that α(n) ≤ 4 for any n up to 265536. For all practical purposes, each operation is O(1).

The Ackermann function explodes. A(4, 2) has 19,729 digits. A(5, 1) is incomprehensibly large. The inverse Ackermann function α(n) asks: "how many times must we apply A before reaching n?" The answer is at most 4 for any number we could ever store in any computer. This is why Union-Find is effectively constant-time per operation — the theoretical log factor exists mathematically but is unobservable in practice.

Why Both Optimizations Together?

Union by rank alone gives O(log n) per operation. Path compression alone (without union by rank) gives O(log n) amortized per operation. But together they achieve O(α(n)) amortized — better than either alone. The reason is subtle: path compression flattens the tree, making future finds faster, while union by rank prevents the tree from getting deep in the first place. Rank limits the "damage" and compression repairs it.

OptimizationsFind Timem operations on n elements
NeitherO(n) worst caseO(mn)
Union by rank onlyO(log n)O(m log n)
Path compression onlyO(log n) amortizedO(m log n)
BothO(α(n)) amortizedO(m α(n))

Rank vs Size

An alternative to union by rank is union by size: always attach the smaller tree under the larger tree. Both give O(log n) height guarantees, and both combine with path compression to give O(α(n)). Union by size is sometimes preferred because the size is always accurate (ranks become stale after path compression, though this does not affect correctness). In interviews, either is fine.

python
# Union by size variant
class UnionFindSize:
    def __init__(self, n):
        self.parent = list(range(n))
        self.size = [1] * n   # size instead of rank

    def find(self, x):
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]

    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry: return False
        if self.size[rx] < self.size[ry]:
            rx, ry = ry, rx
        self.parent[ry] = rx
        self.size[rx] += self.size[ry]  # size is always accurate
        return True

See It: Interactive Union-Find

Click two nodes to Union them. Watch the trees merge. Notice how path compression flattens the tree after each Find.

Union-Find Playground

Click a node to select it (yellow). Click another to Union them. Press "Find" to run Find on the selected node and see path compression.

Click a node to select it.

Full Implementation

python
class UnionFind:
    def __init__(self, n):
        self.parent = list(range(n))  # each element is its own root
        self.rank = [0] * n            # rank starts at 0
        self.count = n                   # number of disjoint sets

    def find(self, x):
        # Path compression: make every node point to root
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]

    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry:
            return False  # already in same set
        # Union by rank: attach shorter tree under taller
        if self.rank[rx] < self.rank[ry]:
            rx, ry = ry, rx
        self.parent[ry] = rx
        if self.rank[rx] == self.rank[ry]:
            self.rank[rx] += 1
        self.count -= 1
        return True

    def connected(self, x, y):
        return self.find(x) == self.find(y)
Interview pattern. Union-Find appears in: number of connected components, redundant connections, accounts merge, earliest time everyone becomes friends, graph valid tree, number of islands (union variant). The template is always: (1) init UF with n elements, (2) for each edge, union the endpoints, (3) answer = uf.count or some query on the structure.

Weighted Union-Find (for MST Variants)

Sometimes you need not just "are x and y connected?" but "what is the weight/distance/relation between x and y within the component?" This is the weighted Union-Find, where each node stores a weight relative to its parent. Find returns both the root and the accumulated weight from x to root. Union adjusts weights so the invariant is maintained.

python
class WeightedUnionFind:
    def __init__(self, n):
        self.parent = list(range(n))
        self.rank = [0] * n
        self.weight = [0] * n  # weight[x] = weight from x to parent[x]

    def find(self, x):
        if self.parent[x] == x:
            return x, 0
        root, w = self.find(self.parent[x])
        self.parent[x] = root  # path compression
        self.weight[x] += w     # accumulate weight to root
        return root, self.weight[x]

    def union(self, x, y, w):
        """Assert weight(y) - weight(x) = w"""
        rx, wx = self.find(x)
        ry, wy = self.find(y)
        if rx == ry:
            return wy - wx == w  # consistency check
        # weight[ry] = wx + w - wy
        self.weight[ry] = wx + w - wy
        self.parent[ry] = rx
        return True

This appears in problems like "evaluate division" (LeetCode 399), where edges have multiplicative weights and you need to compute the product along paths.

Iterative Path Compression (Stack-safe)

The recursive Find can overflow the stack for very deep trees (before compression flattens them). Here is the iterative version:

python
def find_iterative(self, x):
    # Phase 1: walk to root
    root = x
    while self.parent[root] != root:
        root = self.parent[root]

    # Phase 2: compress — point everything to root
    while self.parent[x] != root:
        nxt = self.parent[x]
        self.parent[x] = root
        x = nxt

    return root

Same asymptotic complexity. Two passes instead of one. No recursion. Use this in production code or when n can exceed the recursion limit (Python defaults to 1000).

Half-Path Compression (Simpler Alternative)

Instead of full path compression (all nodes point to root), path splitting makes each node point to its grandparent during Find. Simpler, no second pass, and still achieves O(α(n)) amortized:

python
def find_splitting(self, x):
    while self.parent[x] != x:
        self.parent[x] = self.parent[self.parent[x]]  # skip to grandparent
        x = self.parent[x]
    return x

This is the version used in many competitive programming implementations because it is one line of compression with no need for a separate root-finding pass.

Persistent Union-Find

What if you need to answer queries about component structure at different points in time? For example: "After adding the first k edges, how many components are there?" You can use offline processing: sort queries by k, process edges in order, answer queries as you go. For fully persistent Union-Find (query any past state), use a persistent array with O(log n) overhead per operation.

Union-Find Complexity Proof Sketch

Why does union by rank guarantee O(log n) height? Each time a node's depth increases by 1, its subtree size at least doubles (because it was unioned with a subtree of equal or larger size). A subtree of n elements can double at most log2(n) times before it contains all elements. Therefore, maximum height = log2(n).

The proof that path compression + union by rank gives O(α(n)) amortized is one of the most beautiful results in algorithm analysis. It uses a potential function that assigns a "bank account" to each node based on its rank relative to its parent's rank. Each Find pays for the compression work by spending saved potential. The inverse Ackermann function α emerges from counting how many "levels" of rank compression are needed. The full proof is in CLRS Chapter 21 and is worth reading for anyone interested in amortized analysis.

Concept check: You perform these operations: MakeSet(0..7), Union(0,1), Union(2,3), Union(4,5), Union(6,7), Union(0,2), Union(4,6), Union(0,4). How many Find operations does Find(7) take to reach the root WITHOUT path compression, if trees were built with union-by-rank?

Chapter 6: MST Showcase

This is the payoff. Below is a fully interactive graph editor. Place vertices, draw weighted edges, then run Kruskal or Prim with step-by-step animation. Modify the graph and see how the MST changes.

Interactive MST Builder

Build: Click empty space to place a vertex. Click vertex then vertex to add an edge (enter weight in prompt). Run: Choose algorithm and click "Run" to animate MST construction. Delete: Right-click a vertex or edge to remove it.

Place vertices by clicking. Connect them by clicking two vertices.
MST Weight: —
Experiments to try. (1) Build a triangle with weights 1, 2, 3. The MST uses edges 1 and 2. (2) Add a vertex in the middle connected to all three — watch which edges get replaced. (3) Make a complete graph on 5 vertices with random weights, run Kruskal, then run Prim from different start vertices — same total weight every time. (4) Make a graph where the cheapest edge creates a cycle — Kruskal skips it.

Understanding the Showcase Controls

Adding vertices: Click on empty space. Each vertex gets a sequential number (0, 1, 2, ...). You can place up to 15 vertices.

Adding edges: Click on a vertex to select it (turns yellow). Click on another vertex. A prompt appears asking for the edge weight. Enter a number between 1 and 99. The edge appears between the two vertices with the weight label.

Deleting: Right-click on a vertex to remove it and all its edges. Right-click near an edge to remove just that edge. This lets you experiment with how the MST changes when the graph structure changes.

Running algorithms: Select Kruskal or Prim from the dropdown. Click "Run" for automatic animation or "Step" to advance one step at a time. Watch how the algorithms process edges differently — Kruskal considers edges globally by weight, while Prim grows a tree from vertex 0.

"Random Graph" generates 6-8 vertices with a connected spanning tree plus extra random edges. The edge weights are proportional to Euclidean distance. This gives you a realistic graph to test on without manual construction.

What to Observe

Try building a graph where Kruskal and Prim process edges in very different orders but end up with the same MST. Then try building one where they produce different MSTs (same total weight). The key: tie-breaking. With distinct weights, both always produce the identical MST.

Build a "star" graph (one central vertex connected to all others) and observe that the MST is the star itself — every edge is a bridge (removing it disconnects the graph), so every edge must be in any spanning tree. Build a "ring" graph (cycle) and observe that the MST removes the single heaviest edge — the cycle property in action.

Build a complete graph on 4 vertices (K4) with weights 1, 2, 3, 4, 5, 6 and predict which 3 edges the MST will use before running the algorithm. The answer: the three lightest edges (1, 2, 3) IF they do not form a cycle. K4 has vertices {0,1,2,3} and 6 edges. Assign weight 1 to edge 0-1, weight 2 to edge 0-2, weight 3 to edge 1-2, weight 4 to edge 0-3, weight 5 to edge 1-3, weight 6 to edge 2-3. Kruskal adds 0-1(1), 0-2(2), then skips 1-2(3) because it forms a cycle, then adds 0-3(4). MST weight = 1+2+4 = 7.

Edge Cases to Test

Test CaseExpected BehaviorWhat It Tests
Single vertex, no edgesEmpty MST, weight 0Base case handling
Two vertices, one edgeThat edge is the MSTMinimal case
Disconnected graphMST impossible (or forest)Connectivity check
All equal weightsAny spanning tree is MSTTie-breaking
Negative weightsNegative edges always in MSTNegative weight handling
Self-loopsAlways skipped (cannot be in tree)Input validation
Parallel edges (multigraph)Keep only lightest per pairDuplicate handling

Chapter 7: MST Applications

MSTs are not just textbook exercises. They appear in network design, clustering, approximation algorithms, and even image segmentation. Here are the most important applications.

1. Network Design

The original motivation: connecting cities with minimum cable. This generalizes to any infrastructure network — roads, power lines, water pipes, computer networks. The vertices are locations, edges are possible connections, weights are construction costs. The MST gives the cheapest network that connects everything.

In practice, real networks add redundancy (you do not want a single cable failure to partition your network). But the MST is the starting point: it tells you the minimum cost of connectivity, and any additional edges you add for redundancy cost extra.

2. Clustering via MST

Here is a beautiful idea: to partition n points into k clusters, compute the MST and delete the k-1 heaviest edges. The remaining k connected components are your clusters.

Why does this work? The MST connects nearby points with cheap edges and distant points with expensive edges. Deleting the most expensive edges breaks the graph at its natural "seams" — the largest gaps between clusters.

Single-link clustering. This MST-based clustering is equivalent to single-linkage hierarchical clustering from statistics. The MST gives you the entire hierarchy for free: sort MST edges by weight, and each weight threshold defines a different number of clusters. No need to choose k in advance.

Clustering Implementation

python
def mst_cluster(points, k):
    """Cluster n points into k groups using MST."""
    n = len(points)
    # Build complete graph edges
    edges = []
    for i in range(n):
        for j in range(i + 1, n):
            dx = points[i][0] - points[j][0]
            dy = points[i][1] - points[j][1]
            edges.append(((dx*dx + dy*dy) ** 0.5, i, j))

    # Find MST via Kruskal
    mst_edges = kruskal(n, edges)

    # Sort MST edges by weight (descending)
    mst_edges.sort(reverse=True)

    # Remove k-1 heaviest edges
    uf = UnionFind(n)
    for w, u, v in mst_edges[k - 1:]:
        uf.union(u, v)

    # Return cluster labels
    labels = [uf.find(i) for i in range(n)]
    return labels

The spacing of a k-clustering is the minimum distance between any two points in different clusters. The MST-based algorithm maximizes spacing: the k-1 edges we remove are the heaviest in the MST, so the minimum inter-cluster distance (= the (k-1)-th heaviest MST edge) is as large as possible. CLRS proves this optimality in Exercise 23.2-8.

Why MST Clustering Works: The Proof

We want to show: the MST-based k-clustering maximizes spacing over all possible k-clusterings.

Let C* be the MST-based clustering (remove k-1 heaviest MST edges). Let d* be its spacing (the weight of the (k-1)-th heaviest MST edge). Let C be any other k-clustering. We will show spacing(C) ≤ d*.

Since C* and C both have k clusters but the underlying vertex set is the same, there must be two points p, q that are in the same cluster in C* but in different clusters in C. Consider the MST path from p to q. All edges on this path have weight ≤ d* (because the path is within a single component of C*, and we only removed edges with weight > d*).

Since p and q are in different clusters in C, the MST path from p to q must cross a cluster boundary in C at some edge (u, v). This edge has weight ≤ d*. Therefore, points u and v are in different clusters in C with distance ≤ d*. So spacing(C) ≤ d*. QED.

The MST clustering is provably optimal for maximizing minimum inter-cluster distance. This is one of the few clustering objectives with a polynomial-time exact solution. K-means, for instance, is NP-hard to solve optimally.

See It: MST Clustering

The canvas shows scattered points. The MST connects them all. Use the slider to remove the heaviest edges and watch clusters emerge.

MST Clustering

Slide to remove heavy edges. Colors show clusters.

Clusters (k) 1

3. Approximate TSP

The Travelling Salesman Problem (TSP) is NP-hard: find the shortest tour visiting every city exactly once. MSTs give a 2-approximation: compute the MST, do a DFS preorder walk, and the resulting tour is at most twice the optimal. This works because the MST weight is a lower bound on the optimal tour (any tour minus one edge is a spanning tree).

// TSP 2-approximation via MST:
1. Compute MST of the complete graph
2. Double every edge (Eulerian multigraph)
3. Find an Euler tour
4. Shortcut repeated vertices

// Result: tour weight ≤ 2 · MST weight ≤ 2 · OPT

4. Image Segmentation

Felzenszwalb's algorithm (2004) uses a variant of MST for image segmentation. Treat each pixel as a vertex and neighboring pixels as edges weighted by color difference. Run a modified Kruskal: merge two components if the cheapest edge between them is "internal" (its weight is less than the minimum internal difference of either component). The result is a set of segments that correspond to visually coherent regions.

// Felzenszwalb merge criterion:
merge(C1, C2) if w(e) ≤ min( Int(C1) + τ/|C1|, Int(C2) + τ/|C2| )

// Int(C) = max weight edge in MST of component C
// τ = threshold parameter controlling segment size
// Small τ = many small segments, large τ = few large segments

5. Boruvka's Algorithm (Bonus)

There is a third MST algorithm we have not covered: Boruvka's algorithm (1926, predates both Kruskal and Prim). Each component simultaneously picks its cheapest outgoing edge. All these edges are added in one round. Repeat until one component remains. Runtime: O(E log V). Boruvka is highly parallelizable — each round can be done in parallel — and is the basis for modern parallel MST algorithms.

6. Maze Generation

A popular way to generate random mazes: create a grid graph where each cell is a vertex and walls are edges. Assign random weights to all walls. Compute the MST. The MST edges become passages (walls removed). The result is a perfect maze — exactly one path between any two cells, no loops, no unreachable areas. Different random weights produce different mazes. This works because the MST is a spanning tree: connected (you can reach any cell) and acyclic (no loops, so every path is unique).

MST = minimum-cost infrastructure + maximum structure. Notice the pattern: MST gives you the cheapest connected structure with no redundancy. In networks, this means minimum cable. In clustering, it means the natural hierarchy of distances. In mazes, it means the unique path property. In image segmentation, it means following natural boundaries. The same mathematical object — a minimum-weight spanning tree — appears in wildly different domains because the underlying problem is the same: connect everything at minimum cost.

MST as Approximation: Christofides' Algorithm

The 2-approximation via MST can be improved. Christofides' algorithm (1976) gives a 1.5-approximation for metric TSP:

1. MST
Compute MST of the complete graph.
2. Odd Vertices
Find vertices with odd degree in the MST. By the handshaking lemma, there is an even number of them.
3. Matching
Find a minimum-weight perfect matching on the odd-degree vertices.
4. Euler Tour
Combine MST + matching edges. Every vertex now has even degree. Find an Euler tour.
5. Shortcut
Shortcut repeated vertices to get a Hamiltonian tour. Result ≤ 1.5 × OPT.

This remained the best polynomial-time approximation for metric TSP for 45 years until Karlin, Klein, and Gharan (2020) improved it by an epsilon. MST is the foundation of both algorithms.

MST Lower Bound for TSP: A Worked Example

Consider 4 cities in a square: A(0,0), B(1,0), C(1,1), D(0,1). All six edge weights are the Euclidean distances: AB=1, BC=1, CD=1, DA=1, AC=√2≈1.41, BD=√2≈1.41.

MST: pick edges AB(1), BC(1), CD(1). Total = 3. This is the cheapest way to connect all cities.

Optimal TSP tour: A→B→C→D→A. Total = 1+1+1+1 = 4.

The MST-based 2-approximation: double the MST edges (A-B, B-C, C-D, C-B, B-A, A-B... wait, that creates a multigraph). Walk the doubled MST in DFS preorder: A→B→C→D. Shortcut back: D→A. Tour = 1+1+1+1 = 4. In this case, the approximation gives the optimal tour.

The guarantee: MST weight (3) ≤ OPT (4) ≤ 2 × MST weight (6). The actual tour is 4, well within the bound. For some pathological cases the tour can approach 2 × MST, but never exceed it.

Concept check: You have 100 points and want to cluster them into 5 groups using MST-based clustering. How many MST edges do you delete?

Chapter 8: Interview Arsenal

MST problems are interview favorites because they test graph fundamentals, greedy reasoning, and Union-Find implementation all at once. Here is your cheat sheet and drill set.

Complexity Cheat Sheet

AlgorithmTimeSpaceData StructureBest For
KruskalO(E log E)O(V + E)Union-FindSparse graphs, edge lists
Prim (binary)O((V+E) log V)O(V + E)Binary heapDense with adj list
Prim (Fibonacci)O(E + V log V)O(V + E)Fibonacci heapVery dense graphs
BoruvkaO(E log V)O(V + E)Union-FindParallel MST

Coding Drill 1: Min Cost to Connect All Points

LeetCode 1584. Given n points in 2D, connect all points with minimum total Manhattan distance. Every pair is a potential edge.

python
def minCostConnectPoints(points):
    n = len(points)
    # Build edge list: O(n^2) edges
    edges = []
    for i in range(n):
        for j in range(i + 1, n):
            d = abs(points[i][0] - points[j][0]) + abs(points[i][1] - points[j][1])
            edges.append((d, i, j))
    edges.sort()

    # Kruskal
    uf = UnionFind(n)
    total = 0
    count = 0
    for w, u, v in edges:
        if uf.union(u, v):
            total += w
            count += 1
            if count == n - 1: break
    return total
Optimization note. With n = 1000, there are ~500,000 edges. Sorting takes ~10M comparisons. For this problem, Prim is actually faster because you avoid building the full edge list — just check all neighbors of each extracted vertex. Prim: O(V2) = 1,000,000. Kruskal: O(E log E) ≈ 10,000,000.

Coding Drill 2: Critical and Pseudo-Critical Edges

LeetCode 1489. An edge is critical if removing it increases MST weight. An edge is pseudo-critical if it appears in some (but not all) MSTs.

python
def findCriticalAndPseudoCritical(n, edges):
    # Add original index to each edge
    edges_idx = [(w, u, v, i) for i, (u, v, w) in enumerate(edges)]
    edges_idx.sort()
    base_weight = kruskal_weight(n, edges_idx)

    critical, pseudo = [], []
    for i in range(len(edges_idx)):
        # Test critical: remove edge i, check if MST weight increases
        w_without = kruskal_weight(n, edges_idx[:i] + edges_idx[i+1:])
        if w_without > base_weight:
            critical.append(edges_idx[i][3])
        else:
            # Test pseudo-critical: force-include edge i, check if MST weight equals base
            w_with = kruskal_weight_force(n, edges_idx, i)
            if w_with == base_weight:
                pseudo.append(edges_idx[i][3])
    return [critical, pseudo]

Coding Drill 3: MST with Constraints

Common variant: "Find MST but edge (a, b) must be included." Force-include the edge first (Union a and b), then run Kruskal on the rest. If the result is disconnected, it is impossible.

python
def mst_with_forced_edge(n, edges, forced_u, forced_v, forced_w):
    """MST that must include edge (forced_u, forced_v)."""
    uf = UnionFind(n)
    uf.union(forced_u, forced_v)  # force-include this edge
    mst = [(forced_w, forced_u, forced_v)]

    edges.sort()
    for w, u, v in edges:
        if (u == forced_u and v == forced_v) or \
           (u == forced_v and v == forced_u):
            continue  # skip the forced edge (already added)
        if uf.union(u, v):
            mst.append((w, u, v))
            if len(mst) == n - 1: break
    return mst

Coding Drill 4: Redundant Connection

LeetCode 684. Given a tree with one extra edge (n vertices, n edges), find the edge that, when removed, leaves a tree. This is the edge that creates the cycle.

python
def findRedundantConnection(edges):
    """edges = [[u, v], ...], 1-indexed"""
    n = len(edges)
    uf = UnionFind(n + 1)  # 1-indexed
    for u, v in edges:
        if not uf.union(u, v):
            return [u, v]  # this edge closes a cycle
Key insight: Process edges in order. The FIRST edge that fails to union (because both endpoints are already connected) is the answer. This is Kruskal in disguise — we are building a forest and detecting the first cycle-creating edge.

Coding Drill 5: Minimum Cost to Make Graph Connected

LeetCode 1319. Given n computers and a list of connections, find the minimum number of operations to connect all computers. An operation moves one cable from an existing connection to a new one.

python
def makeConnected(n, connections):
    if len(connections) < n - 1:
        return -1  # not enough cables
    uf = UnionFind(n)
    for u, v in connections:
        uf.union(u, v)
    return uf.count - 1  # need (components - 1) operations

Coding Drill 6: Accounts Merge

LeetCode 721. Each account is a name + list of emails. Two accounts belong to the same person if they share any email. Merge all accounts by person. This is a Union-Find problem in disguise: emails are vertices, accounts define edges (all emails in one account are connected).

python
from collections import defaultdict

def accountsMerge(accounts):
    uf = UnionFind(len(accounts) * 10)  # generous upper bound
    email_to_id = {}
    email_to_name = {}
    uid = 0

    for account in accounts:
        name = account[0]
        for email in account[1:]:
            if email not in email_to_id:
                email_to_id[email] = uid
                uid += 1
            email_to_name[email] = name
            uf.union(email_to_id[account[1]], email_to_id[email])

    # Group emails by root
    groups = defaultdict(list)
    for email, eid in email_to_id.items():
        groups[uf.find(eid)].append(email)

    return [[email_to_name[emails[0]]] + sorted(emails)
            for emails in groups.values()]

Coding Drill 7: Boruvka's Algorithm

For completeness, here is Boruvka implemented. It is rarely asked in interviews but understanding it deepens your MST knowledge and it has the best parallel complexity.

python
def boruvka(n, edges):
    """n = vertices, edges = [(w, u, v), ...]"""
    uf = UnionFind(n)
    mst = []
    num_components = n

    while num_components > 1:
        # For each component, find cheapest outgoing edge
        cheapest = [None] * n  # cheapest[root] = (w, u, v)

        for w, u, v in edges:
            ru, rv = uf.find(u), uf.find(v)
            if ru == rv: continue  # same component
            if cheapest[ru] is None or w < cheapest[ru][0]:
                cheapest[ru] = (w, u, v)
            if cheapest[rv] is None or w < cheapest[rv][0]:
                cheapest[rv] = (w, u, v)

        # Add all cheapest edges (avoid duplicates)
        for i in range(n):
            if cheapest[i] is not None:
                w, u, v = cheapest[i]
                if uf.union(u, v):
                    mst.append((w, u, v))
                    num_components -= 1

    return mst
Boruvka's key property: Each round halves the number of components (each component merges with at least one other). So there are at most O(log V) rounds. Each round scans all E edges. Total: O(E log V). The parallelism: in each round, all components can find their cheapest edge simultaneously.

Common Interview Mistakes

MistakeWhy It's WrongHow to Avoid
Confusing MST with shortest-path treeMST minimizes total weight, not individual pathsTrace both on a small example; show they differ
Forgetting path compression in Union-FindCorrect but slow — O(n log n) instead of O(n α(n))Always write the two-line path compression in Find
Using DFS for cycle detection instead of Union-FindWorks but O(V+E) per edge check = O(VE) totalUnion-Find gives O(α(n)) per check = O(E α(n)) total
Assuming MST is uniqueOnly unique when all edge weights are distinctNote: same total weight always, but edges may differ
Running Prim on disconnected graphPrim finds MST of one component onlyCheck if all vertices were extracted; if not, run Prim again from an unvisited vertex
Off-by-one: V edges instead of V-1Trees have V-1 edges, not VStop Kruskal at V-1 edges; check count before returning

MST Problem Variants in Interviews

Beyond the standard MST, interviewers love these twists:

Degree-constrained MST: No vertex can have more than d edges. This is NP-hard for d ≥ 2. In interviews, mention it as a limitation of standard MST.

Euclidean MST: Points in 2D/3D, edges are Euclidean distances. The complete graph has O(n2) edges, but the Euclidean MST can be found in O(n log n) using the Delaunay triangulation (MST is a subgraph of the Delaunay triangulation, which has O(n) edges). This is the optimal approach for LeetCode 1584 when n is large.

Online MST: Edges arrive one at a time. Maintain MST dynamically. When a new edge arrives, add it; if it creates a cycle, remove the heaviest edge in the cycle. This maintains the MST invariant. With a link-cut tree, each update is O(log n).

Quick Reference: MST Properties

These are properties you should be able to state and justify in an interview:

PropertyStatementProof Idea
UniquenessDistinct weights ⇒ unique MSTCut property: light edge is unique, so every safe edge is forced
Weight multisetAll MSTs have the same sorted weight sequenceMatroid theory / exchange argument
Cut propertyLight edge crossing a respecting cut is safeExchange: swap it in, swap out a heavier crossing edge
Cycle propertyHeaviest edge in a cycle is not in MST (distinct weights)Exchange: remove it, add a lighter cycle edge
BottleneckMST minimizes max-weight edge on pathsAny non-MST edge is heaviest in its fundamental cycle
V-1 edgesEvery MST has exactly V-1 edgesIt is a tree (connected + acyclic)
Adding non-MST edgeCreates exactly one cycleTree + one edge = exactly one cycle (tree property)
Max-weight edgeThe max-weight MST edge = (k-1)th heaviest for k=2 clustersClustering interpretation: MST gives optimal 2-clustering

Debugging Checklist for MST Implementations

When your MST code gives the wrong answer, check these in order:

1. Graph Connected?
If not, MST does not exist. Kruskal returns < V-1 edges. Prim misses vertices.
2. Edge Directions?
MST is for undirected graphs. If edges are directed, add both (u,v) and (v,u) to adjacency list.
3. 0-indexed vs 1-indexed?
Union-Find with n=5 but vertices labeled 1-5 needs parent array of size 6 (or adjust to 0-indexed).
4. Path Compression?
Missing path compression: correct but TLE. Missing union-by-rank: correct but TLE.
5. Stale PQ Entries?
In lazy Prim, forgetting if in_mst[u]: continue causes double-counting.
6. Edge Count?
MST has V-1 edges. If you return V edges, you have a cycle. If V-2, you have two components.

Pattern Recognition Guide

Signal in ProblemAlgorithm / Approach
"Connect all nodes with minimum cost"Direct MST (Kruskal or Prim)
"Minimum cost to make graph connected"MST on components
"Redundant connection" / "find the extra edge"Union-Find: edge that closes a cycle
"Number of connected components"Union-Find (or BFS/DFS)
"Group into k clusters minimizing max inter-cluster distance"MST + remove k-1 heaviest edges
"Can we connect all with cost ≤ budget?"MST weight ≤ budget
"Which edges are critical / must-use?"Try removing each edge, check MST weight

Complexity Decision Tree

When should you use which algorithm? Here is a decision tree for interviews:

Is the graph dense (E ≈ V2)?
YES → Prim with adjacency matrix, O(V2). NO → continue.
Do you have an edge list or adjacency list?
Edge list → Kruskal. Adjacency list → Prim with binary heap.
Need parallelism?
YES → Boruvka. NO → Kruskal (simpler to code).
Edge weights are small integers?
Consider radix sort + Kruskal for O(E + V) via counting sort.

See It: MST Weight Explorer

This canvas shows a fixed graph. Drag the slider to change one edge's weight and watch the MST change in real time. See how a single weight change can restructure the entire MST.

MST Edge Weight Sensitivity

Slide to change the weight of edge A-D. Watch which edges are in the MST.

w(A-D) 7

The Five Interview Dimensions for MST

DimensionWhat They AskStaff-Level Answer
Concept"Prove Kruskal is correct"Cut property + exchange argument. Show which cut Kruskal uses at each step.
Design"Design a network optimizer for 1M nodes"Kruskal with external sort + batched Union-Find. Discuss I/O complexity.
Code"Implement Kruskal from scratch"Union-Find with both optimizations, clean edge list, handle disconnected graphs.
Debug"MST returns wrong weight"Check: graph connected? Weights correct? Union-Find find() has path compression? Off-by-one in edge count?
Frontier"How would you MST on a billion-node graph?"Boruvka for parallelism. External-memory MST. Approximate MST via random sampling (Karger-Klein-Tarjan).
Staff-level question: You implement Kruskal but forget path compression in Union-Find. Everything else is correct. Does the MST come out wrong?

Chapter 9: Connections

MSTs sit at a crossroads of graph theory, greedy algorithms, and data structures. Here is how this chapter connects to the broader algorithmic landscape.

Where MST Fits

TopicRelationship to MSTLink
Graph Algorithms (Ch 22)MST requires graph representation (adjacency lists) and graph traversal (BFS/DFS for connectivity). MST IS a graph algorithm.Ch 22 lesson
Greedy (Ch 16)MST algorithms are the canonical examples of greedy correctness. The cut property is the greedy-choice property for MST. Activity selection is to scheduling as Kruskal is to MST.Ch 16 lesson
Shortest Paths (Ch 24)Prim looks like Dijkstra with one key difference: Prim's key[v] = min edge weight to tree, Dijkstra's d[v] = total distance from source. Same algorithm structure, different relaxation. MST ≠ shortest-path tree.Ch 24 lesson
Heaps (Ch 6)Prim's efficiency depends on the priority queue. Binary heap gives O((V+E) log V). Fibonacci heap gives O(E + V log V). Understanding heaps is prerequisite to understanding Prim's complexity.Ch 6 lesson

MST vs Shortest Paths: The Critical Distinction

This is the number one source of confusion and the most common interview follow-up. Consider a triangle with vertices A, B, C and edges A-B(1), B-C(1), A-C(3). The MST uses A-B(1) and B-C(1), total weight 2. The shortest path from A to C in the original graph is A-C(3), weight 3. But in the MST, the path from A to C goes A→B→C with weight 2. Sometimes the MST path is shorter, sometimes longer. The MST minimizes total wire, not individual paths.

Here is an important property that IS true: the MST path between any two vertices minimizes the bottleneck — the maximum edge weight along the path. The MST is a minimum bottleneck spanning tree. So while the MST path from u to v might be longer than the shortest path in total weight, the heaviest single edge on the MST path is as light as possible. This is why MSTs are useful in communication networks where the bottleneck link determines throughput.

Prim ≈ Dijkstra. The algorithmic structure is identical: extract-min from priority queue, update neighbors. The difference is ONE line. Prim: if w(u,v) < key[v] (compare edge weight). Dijkstra: if d[u] + w(u,v) < d[v] (compare path length). This single difference — edge weight vs cumulative distance — produces fundamentally different trees.

What We Did Not Cover

Matroid theory: MST can be generalized using matroids — a mathematical structure that captures exactly when greedy works. A matroid is a pair (S, I) where S is a finite set and I is a collection of "independent" subsets satisfying: (1) the empty set is independent, (2) subsets of independent sets are independent, (3) if A and B are independent and |A| < |B|, then some element of B can be added to A to keep it independent. The graphic matroid (where S = edges and independent sets = forests) explains MST. The greedy algorithm for matroids is exactly Kruskal's algorithm generalized.

Steiner trees: MST connects ALL vertices. What if you only need to connect a subset? This is the Steiner tree problem, and it is NP-hard. MST gives a 2-approximation for Steiner trees in general graphs. The problem arises in VLSI chip design (connect specific pins), network design (connect specific offices), and phylogenetics (connect specific species).

Dynamic MST: What if edges are added or deleted over time? Maintaining the MST dynamically is an active research area. The best known algorithms achieve O(log4 n) per update (Holm et al., 2001). For the insertion-only case (edges only added, never deleted), you can maintain the MST in O(α(n)) amortized per edge using the cycle property: when adding edge e, if it connects two components, add it. If it creates a cycle, remove the heaviest edge in the cycle (which might be e itself).

Randomized MST in O(V + E): Karger, Klein, and Tarjan (1995) showed that MST can be computed in expected O(V + E) time using random sampling. The algorithm: (1) sample edges with probability 1/2, (2) recursively find MST of the sample, (3) use the sample MST to identify "heavy" edges in the full graph that cannot be in the MST (an edge is heavy if it is the heaviest edge in its fundamental cycle with respect to the sample MST), (4) remove heavy edges and recurse. This is theoretically optimal but rarely used in practice due to high constant factors.

Minimum bottleneck spanning tree: Minimize the maximum-weight edge (rather than total weight). It turns out every MST is a minimum bottleneck spanning tree (but not vice versa). So computing the MST solves the bottleneck problem for free. This arises in network design where you want to minimize the worst link, not the total cost.

Second-best MST: Find the spanning tree with the smallest total weight that is NOT the MST. The algorithm: for each non-MST edge (u,v), find the heaviest edge on the MST path from u to v (using LCA queries in O(log V) after O(V) preprocessing). The second-best MST is obtained by swapping in the non-MST edge that gives the smallest increase: min over all non-MST edges of (w(u,v) - max-weight-on-MST-path(u,v)). Total time: O(V2) or O(E log V) with sparse table LCA.

MST in the Wild: Real Systems

DomainVerticesEdgesWeightScale
Internet backboneRoutersPhysical linksLatency or cost~100K vertices
Power gridSubstationsPossible transmission linesConstruction cost~50K vertices
VLSI routingPins/padsAll pairsWire length~10M vertices
PhylogeneticsSpecies/genesAll pairsGenetic distance~100K vertices
Social network clusteringUsersConnections1 / similarity~1B vertices

At billion-vertex scale, no single-machine algorithm suffices. Distributed MST uses Boruvka rounds across machines (each machine handles a partition of edges, communicates cheapest crossing edges). MapReduce MST and GraphX (Spark) implement this pattern.

Historical Timeline

YearWhoWhatComplexity
1926BoruvkaFirst MST algorithm (for electrical network design)O(E log V)
1930JarníkPrim's algorithm (independently discovered)O(V2)
1956KruskalSort-and-merge approachO(E log E)
1957PrimRediscovery of Jarník's algorithmO(V2)
1975TarjanUnion-Find with path compression + union by rankO(m α(n))
1987Fredman & TarjanFibonacci heaps → improved PrimO(E + V log V)
1995Karger, Klein, TarjanRandomized linear-time MSTO(V + E) expected
2000ChazelleDeterministic near-linear MSTO(E α(V))

The MST problem has driven fundamental advances in algorithm design for nearly a century. Each decade brought new data structures (Union-Find, Fibonacci heaps, soft heaps) motivated by the quest for faster MST algorithms.

Open Problem

Is there a deterministic O(V + E) algorithm for MST? Karger-Klein-Tarjan achieves this in expected time (randomized). Chazelle's deterministic algorithm achieves O(E α(V)), which is NEARLY linear but not quite. Closing this gap — or proving a lower bound showing O(V + E) is impossible deterministically — remains open. It is one of the most elegant unsolved problems in algorithms.

Key Relationships Diagram

Greedy (Ch 16)
Cut property proves greedy works for MST. Activity selection uses similar exchange argument.
MST (Ch 23) ← You Are Here
Kruskal (edge-centric), Prim (vertex-centric), Boruvka (parallel). All use cut property.
Shortest Paths (Ch 24)
Prim ≈ Dijkstra structurally. Different relaxation: edge weight vs cumulative distance.
Max Flow / Min Cut (Ch 26)
Cuts appear in both MST and flow. MST cut property is about minimum-weight crossing edges. Max-flow min-cut is about maximum capacity.
"The MST problem is the poster child for the greedy paradigm." — Jeff Erickson, Algorithms.

Summary: The MST Toolkit

ConceptOne-Sentence Summary
Spanning treeConnected, acyclic subgraph hitting all vertices — exactly V-1 edges.
Cut propertyThe lightest edge crossing a cut respecting A is safe to add to the MST.
Cycle propertyThe heaviest edge in any cycle is NOT in the MST (distinct weights).
KruskalSort edges, add if no cycle (Union-Find). O(E log E). Edge-centric.
PrimGrow tree, add cheapest frontier edge (priority queue). O((V+E) log V). Vertex-centric.
BoruvkaAll components simultaneously add cheapest outgoing edge. O(E log V). Parallelizable.
Union-FindTrack components with path compression + union by rank. O(α(n)) per op.
Final concept check: Prim's algorithm and Dijkstra's algorithm have almost identical code. What is the ONE critical difference?