CS 229s — Systems for Machine Learning

Cluster Scheduling

A thousand GPUs, a thousand jobs. FIFO wastes half your cluster. Heterogeneity-aware scheduling, placement sensitivity, and fairness — how to keep every GPU busy.

Prerequisites: Basic systems concepts. That's it.
9
Chapters
6+
Simulations
0
Assumed Knowledge

Chapter 0: Why Cluster Scheduling?

You work at a company training ML models. Your team has 8 GPUs. This morning, five researchers submit jobs: Alice wants to train a large language model (needs 4 GPUs, takes 6 hours). Bob needs to fine-tune a vision model (2 GPUs, 1 hour). Carol is running a hyperparameter sweep (1 GPU each, 30 minutes per run). Dave has a critical demo tomorrow (2 GPUs, 3 hours). Eve is testing a quick experiment (1 GPU, 15 minutes).

Who goes first? If you run them in the order they arrive (FIFO), Eve's 15-minute experiment might wait 6 hours behind Alice's giant job. If you prioritize short jobs, Alice's important LLM training never finishes. If you give everyone equal resources, nobody finishes on time.

This is the cluster scheduling problem: given a queue of jobs with different sizes, durations, and resource needs, decide which jobs run when and on which GPUs. Get it wrong and you waste millions of dollars in idle hardware. Get it right and the same cluster serves twice as many researchers.

The core tension: Every scheduling decision is a tradeoff. Minimize average wait time? Short jobs love you, long jobs starve. Maximize GPU utilization? You pack the cluster tight but nobody gets dedicated resources. Maximize fairness? Everyone waits equally long, but total throughput drops. There is no free lunch.
GPU Cluster Queue Simulator

Watch jobs arrive and compete for 4 GPUs. Orange bars = running. Gray bars = waiting in queue. Notice how long some jobs wait.

At scale, this problem is enormous. The Alibaba ML cluster analyzed in this lecture has 6,700+ GPUs across 1,800 servers handling 1.2 million jobs. Microsoft's Philly cluster processes thousands of deep learning jobs daily. Every minute a GPU sits idle costs real money. Every minute a researcher waits is lost productivity.

Check: What makes cluster scheduling for ML particularly hard?

Chapter 1: Scheduling Fundamentals

Before we optimize anything, let's nail down the basic architecture. Every cluster scheduler has three components: a job queue, a scheduler, and a GPU cluster.

Job Queue
Users submit training jobs. Each job specifies: model type, number of GPUs needed, estimated duration, priority.
Scheduler
Follows a policy (algorithm) to decide which job runs next and on which GPUs. This is the brain.
GPU Cluster
Physical hardware: servers with 2-8 GPUs each, connected by high-speed networks. Jobs execute here.

The scheduler's job is to optimize some objective. Here are the metrics we care about:

MetricDefinitionWhat it measures
Job Completion Time (JCT)finish_time − arrival_timeTotal time from submission to completion, including queue wait
Makespanlast_finish − first_arrivalTotal wall-clock time to complete a batch of jobs
Queueing Delaystart_time − arrival_timeHow long a job waits before it starts running
GPU Utilizationbusy_gpu_hours / total_gpu_hoursFraction of GPUs actually doing useful work
Throughputjobs_completed / timeHow many jobs finish per unit time
Key insight: JCT includes queueing delay. A job that takes 1 hour to run but waits 5 hours in the queue has a JCT of 6 hours. This is what users actually experience — they don't care whether they're waiting in queue or waiting for computation. They just want results.

The scheduler also faces constraints. A job requesting 4 GPUs can't run on a 2-GPU node. A distributed training job may need all its GPUs on the same physical server for fast communication. Some jobs have deadlines. Some have higher priority. The scheduler must satisfy all constraints while optimizing the objective.

What makes ML scheduling special?

Traditional job schedulers (like in HPC or cloud computing) handle generic workloads. ML training adds three unique wrinkles:

1. Long-running jobs. A single LLM training run can take days or weeks. You can't just wait for it to finish — other jobs pile up behind it.
2. GPU heterogeneity. Your cluster might have V100s, A100s, and H100s. A ResNet trains 2x faster on an A100 than a V100, but a small BERT model barely benefits. The "best" GPU depends on the job.
3. Large checkpoints. ML models have gigabytes of state (weights, optimizer state). Preempting a job means saving all that state to disk and reloading it later. This isn't free.
Check: What does Job Completion Time (JCT) measure?

Chapter 2: Scheduling Policies

A scheduling policy is the algorithm the scheduler uses to decide which job runs next. Let's walk through three fundamental policies with a concrete example. We have 2 GPUs (one job per GPU) and four jobs:

JobDurationArrival TimePriority
Job 03003 (low)
Job 11001 (high)
Job 22512 (medium)
Job 34044 (lowest)

Policy 1: First-Come First-Served (FCFS / FIFO)

The simplest policy: run jobs in the order they arrive. No preemption — once a job starts, it runs to completion. This is what most people do by default.

Let's trace through it step by step. At time 0, Jobs 0 and 1 arrive. Both GPUs are free, so both start immediately. At time 1, Job 2 arrives but both GPUs are busy — it waits. At time 10, Job 1 finishes (JCT = 10). Job 2 starts on the freed GPU. At time 30, Job 0 finishes (JCT = 30). Job 3 starts. At time 35, Job 2 finishes (JCT = 35 − 1 = 34). At time 70, Job 3 finishes (JCT = 70 − 4 = 66).

FIFO result: Average JCT = (30 + 10 + 34 + 66) / 4 = 35. Makespan = 70. Job 3 waited 26 time units in the queue before even starting. That's queueing delay — time wasted doing nothing.

Policy 2: Shortest Job First (SJF)

When a GPU becomes free, schedule the job with the shortest remaining duration. SJF can also preempt: if a shorter job arrives while a longer one is running, pause the longer job and run the shorter one first.

At time 0, Jobs 0 and 1 start (both GPUs free). At time 1, Job 2 arrives (duration 25). Job 0 has 29 remaining, which is longer than 25, so Job 2 preempts Job 0. At time 10, Job 1 finishes (JCT = 10). Job 0 resumes. At time 26, Job 2 finishes (JCT = 25). At time 39, Job 0 finishes (JCT = 39). At time 66, Job 3 finishes (JCT = 62).

SJF result: Average JCT = (39 + 10 + 25 + 62) / 4 = 34. Makespan = 66. Both improve over FIFO. SJF is provably optimal for minimizing average JCT when job durations are known. The catch? In practice, you rarely know exact durations upfront.

Policy 3: Priority-Based

Each job has a priority. When a GPU opens up (or a higher-priority job arrives), run the highest-priority job. This is useful when some jobs are more important than others — a production model retrain might trump an experimental notebook.

With our priority assignments (Job 1 = highest priority), the priority-based policy preempts aggressively. The result: Average JCT = 43.5, Makespan = 53.

Policy Comparison Simulator

Select a scheduling policy and watch the timeline unfold. Compare average JCT and makespan across policies.

PolicyAvg JCTMakespanProsCons
FIFO3570Simple, fair in arrival orderShort jobs wait behind long ones
SJF3466Optimal avg JCTRequires knowing job duration; long jobs starve
Priority43.553Important jobs finish fastLow-priority jobs may never run
Check: Why does SJF improve average JCT over FIFO?

Chapter 3: Production Traces

Theory is nice, but what do real ML clusters look like? Let's study a trace from Alibaba's production ML cluster, analyzed by Weng et al. (NSDI 2022). This cluster has 6,700+ GPUs (P100, V100, T4) across 1,800 servers, serving 1.2 million training and inference jobs.

What real jobs look like

The first surprise: job durations are heavy-tailed. Most jobs are short (minutes), but a small fraction run for days. This is exactly the scenario where FIFO fails badly — one long job blocks hundreds of short ones.

StatisticValue
Total jobs1.2 million
GPU typesP100, V100, T4
Servers1,800 (2 or 8 GPUs each)
Gang-scheduled jobs85%
Median job duration~minutes
99th percentile duration~days
Gang scheduling: 85% of jobs are gang-scheduled — all GPUs for a distributed training job must be allocated simultaneously. You can't give a 4-GPU job 2 GPUs now and 2 later. Either all 4 are available, or the job waits. This makes bin-packing much harder.

Queueing delay patterns

Two critical findings from the trace analysis:

Finding 1: Queueing delay increases as the resource requirement increases. A job requesting 1 GPU gets scheduled quickly. A job requesting 8 GPUs might wait hours because it needs an entire server to be free.

Finding 2: Queueing delay increases for more powerful GPU types. Everyone wants V100s, so the V100 queue is longer. P100s and T4s often sit idle even though they could run many jobs perfectly well.

Queueing Delay vs Resource Request

This chart shows how queueing delay grows with the number of GPUs requested. More GPUs = longer wait. Hover to see values.

Space sharing helps

Many ML jobs don't fully utilize their GPU. A small model might only use 30% of GPU memory and 40% of compute. Space sharing — packing multiple jobs on the same GPU — can dramatically improve utilization. The Alibaba trace showed that space sharing improved GPU utilization from ~50% to ~80%.

SJF works — if you can estimate durations

The researchers built a simple regression model to estimate job durations using features like the user tag, group tag, and number of GPUs requested. Even this rough estimate was enough for SJF to significantly reduce average JCT compared to FIFO. The lesson: you don't need perfect duration estimates — approximate ones already help.

The practical takeaway: Real clusters have heavy-tailed job distributions, gang scheduling constraints, and heterogeneous hardware. Simple FIFO is provably bad. Even a rough SJF (with estimated durations) beats FIFO significantly. But the real unlock comes from handling GPU heterogeneity — which is next.
Check: Why do jobs requesting more GPUs experience longer queueing delays?

Chapter 4: The Heterogeneity Problem

Here's an insight that changes everything: not every job benefits equally from faster GPUs. This sounds obvious, but most schedulers completely ignore it.

Consider two jobs: a large Transformer (LLM) and a small ResNet (image classifier). Benchmark them across three GPU generations:

JobK80 (old)P100 (mid)V100 (new)Speedup K80→V100
Large Transformer10 iter/s30 iter/s80 iter/s8x
Small ResNet50 iter/s65 iter/s70 iter/s1.4x

The Transformer gets an 8x speedup on a V100 vs a K80. The ResNet gets only a 1.4x speedup. If you have one V100 and one K80, a naive scheduler might assign both jobs randomly. But the smart move is obvious: put the Transformer on the V100 (where it benefits massively) and the ResNet on the K80 (where it barely suffers).

The heterogeneity insight: Throughput is not portable across GPU types. A job that runs at 80 iter/s on a V100 might run at 10 iter/s on a K80 — or it might run at 65 iter/s. The ratio depends entirely on the model architecture. Schedulers that ignore this leave massive performance on the table.

Why does this happen?

Different GPU generations improve different things. The V100 added Tensor Cores (huge for matrix multiplications in Transformers), more memory bandwidth, and faster interconnects. A Transformer is bottlenecked by matrix multiply speed, so it benefits enormously. A small ResNet is bottlenecked by memory transfer and small kernel launches — things that didn't improve as much.

More concretely, the throughput matrix T captures this. Row i is job i, column j is GPU type j, and Tij is the throughput (iterations/second) of job i on GPU type j:

T =
  [ Ttransformer, K80   Ttransformer, P100   Ttransformer, V100 ]
  [ Tresnet, K80       Tresnet, P100       Tresnet, V100 ]

Plugging in our numbers: T = [[10, 30, 80], [50, 65, 70]]. The shape of each row tells the story. The Transformer row grows steeply — it's highly sensitive to GPU type. The ResNet row is nearly flat — it's insensitive.

Throughput Sensitivity Explorer

Each bar group shows a job's throughput across GPU types. Tall spread = heterogeneity-sensitive. Flat = insensitive. The sensitive job should get the fast GPU.

Transformer V100 throughput80
ResNet V100 throughput70

This is the foundation of Gavel, the heterogeneity-aware scheduler we'll study next. Instead of treating all GPUs as interchangeable, Gavel uses the throughput matrix to make smarter assignments.

Check: You have one V100 and one K80. A Transformer gets 80 iter/s on V100 and 10 on K80. A ResNet gets 70 on V100 and 50 on K80. Which assignment maximizes total throughput?

Chapter 5: Gavel — Heterogeneity-Aware Scheduling

Gavel (Narayanan et al., OSDI 2020) is a cluster scheduler that makes all existing scheduling policies heterogeneity-aware. The key idea: express scheduling as an optimization problem over an allocation matrix, using the throughput matrix to make GPU-type-aware decisions.

The allocation matrix

Suppose we have 3 GPU types (K80, P100, V100) and n jobs. The allocation matrix X is n × 3, where Xij is the fraction of time job i spends on GPU type j. Each row sums to at most 1 (a job can't use more than 100% of its time). Each column's total can't exceed the number of available GPUs of that type.

Xij = fraction of time job i spends on GPU type j

Constraints:   ∑j Xij ≤ 1,   ∑i Xij · scalei ≤ Nj

The scale factor accounts for multi-GPU jobs: if a job needs 4 GPUs, each unit of allocation consumes 4 GPUs worth of cluster capacity.

Effective throughput

Given the throughput matrix T and allocation matrix X, the effective throughput of job m is:

effective_throughput(m) = ∑j Tmj · Xmj

This is the throughput-weighted allocation — the total iterations/second job m achieves across all GPU types it uses. A job spending 60% of its time on V100s and 40% on K80s gets 0.6 × TV100 + 0.4 × TK80.

Maximizing total throughput

The simplest Gavel objective: maximize total effective throughput across all jobs.

maximize ∑mj Tmj · Xmj

subject to:   0 ≤ Xij ≤ 1,   ∑j Xij ≤ 1,   ∑i Xij · scalei ≤ Nj

This is a linear program — solvable in milliseconds with off-the-shelf solvers like cvxpy. The throughput matrix T is measured by profiling each job on each GPU type (or estimated from prior runs).

Other objectives as linear programs

The power of Gavel is that any scheduling policy can be expressed this way:

ObjectiveOptimization
Max throughputmaximize ∑m,j Tmj · Xmj
Max-min fairnessmaximize minm (effective_throughput(m) / equal_share_throughput(m))
Minimize SJFminimize runtime of job with shortest remaining iterations / effective_throughput
Minimize makespanminimize runtime of job with longest remaining time
The Gavel result: Heterogeneity-aware scheduling improves average JCT by up to 3.5x compared to heterogeneity-agnostic baselines. This is not a small improvement — it's the difference between your job finishing in 2 hours vs 7 hours.
Gavel Assignment Simulator

Drag jobs to GPU types and see total effective throughput change. Try to maximize it! The optimal assignment is shown as a target. Teal = your current throughput. Orange line = optimal.

Worked example

Suppose we have 2 K80s, 1 P100, and 1 V100. Three jobs, each needing 1 GPU:

K80P100V100
Job A (Transformer)103080
Job B (ResNet)506570
Job C (GAN)205560

Naive assignment (round-robin): A→K80 (10), B→P100 (65), C→V100 (60). Total = 135.

Gavel-optimal: A→V100 (80), B→K80 (50), C→P100 (55). Total = 185. That's a 37% improvement just by being smart about assignment.

Why? The Transformer benefits enormously from V100 (8x over K80). The ResNet barely cares (1.4x). So give the V100 to the Transformer and the cheap K80 to the ResNet.

Check: What does Gavel's throughput matrix T capture?

Chapter 6: Placement Sensitivity

GPU scheduling isn't just about which type of GPU — it's also about where those GPUs are physically located. This is placement sensitivity.

Why placement matters

A distributed training job with 4 GPUs needs those GPUs to communicate during every training step (gradient all-reduce). If all 4 GPUs are on the same physical server, they communicate over NVLink — 600 GB/s of bandwidth. If they're split across two servers, they communicate over the network — maybe 25 GB/s (InfiniBand) or 10 GB/s (ethernet). That's a 24-60x difference.

PlacementInterconnectBandwidthGradient sync time (1 GB)
Same server (consolidated)NVLink~600 GB/s~1.7 ms
Same rack (distributed)InfiniBand~25 GB/s~40 ms
Cross-rackEthernet~10 GB/s~100 ms
The placement tradeoff: Consolidated placement (all GPUs on one server) is faster for communication-heavy jobs. But it's harder to find — you need a server with enough free GPUs. Distributed placement is easier to find but slower. The scheduler must decide: wait for consolidated placement, or start now with distributed?

Communication-bound vs compute-bound

Not every job cares about placement equally:

Communication-heavy jobs (large models with big gradient syncs): placement matters a LOT. A Transformer with 1B parameters has ~4 GB of gradients to sync every step. On NVLink, that's 6.7 ms. Cross-rack, that's 400 ms. Training time could double or triple with bad placement.

Compute-heavy jobs (small models, large batch sizes): placement barely matters. If each step takes 500 ms of computation and only 5 ms of communication, even a 10x slowdown in communication only adds 45 ms — a 9% increase.

Placement Impact Calculator

Adjust compute time and gradient size to see how placement affects total step time. Communication-heavy jobs (large gradient, small compute) suffer most from bad placement.

Compute time (ms)100
Gradient size (GB)1.0

Gavel incorporates placement

Gavel can extend the throughput matrix to include placement information. Instead of just Tij = throughput of job i on GPU type j, we can have Tijk = throughput of job i on GPU type j with placement type k (consolidated vs distributed). This lets the optimizer make placement-aware decisions automatically.

Check: A job has 200ms compute per step and syncs 0.5 GB of gradients. Would it benefit significantly from consolidated placement (NVLink at 600 GB/s vs ethernet at 10 GB/s)?

Chapter 7: Fairness & Objectives

So far we've focused on throughput and JCT. But what about fairness? In a shared cluster, multiple teams submit jobs. If the scheduler always favors Team A's jobs (because they're shorter or higher-priority), Team B's jobs starve. This breeds resentment and politics.

Max-min fairness

The standard fairness notion in scheduling is max-min fairness: maximize the minimum allocation any job receives. In other words, make the worst-off job as well-off as possible.

In a homogeneous cluster, this is simple: give every job an equal share of GPUs. With 100 GPUs and 10 jobs, each gets 10. But in a heterogeneous cluster, "equal share" is ambiguous — equal share of what? GPU-hours? Throughput? Iterations?

Gavel's fairness formulation: Gavel normalizes each job's effective throughput by what it would get with an equal share of all resources. This makes throughputs comparable across jobs with different computational characteristics. The objective is: maximize minm (effective_throughput(m) / equal_share_throughput(m)).

Finish-time fairness

An alternative: finish-time fairness ensures all jobs finish at roughly the same time (proportional to their total work). This is different from throughput fairness — a job with 2x the work should take 2x as long, not get 2x the resources.

The utilization-fairness tradeoff

Here's the fundamental tension: maximizing utilization and maximizing fairness often conflict.

Example: You have 2 V100s and 2 K80s. Job A is a Transformer (V100: 80 iter/s, K80: 10). Job B is a ResNet (V100: 70, K80: 50). Maximum throughput assigns both V100s to A and both K80s to B. Total: 160+100 = 260. But Job B gets only K80s while Job A gets the premium hardware — that's unfair.

Fair assignment: give each job one V100 and one K80. Job A: 80+10 = 90. Job B: 70+50 = 120. Total: 210. We sacrificed 19% throughput for fairness.

Fairness vs Throughput Explorer

The slider moves between max-throughput (left) and max-fairness (right). Watch how total throughput drops as fairness improves. The sweet spot depends on your organization's values.

Fairness weight0

Hierarchical scheduling

Large organizations often have a hierarchy: company → division → team. Each level may have different objectives. The company wants utilization. Each division wants fairness among its teams. Each team wants shortest JCT for its jobs. Gavel supports this by composing objective functions at each level of the hierarchy.

Check: Why does maximizing throughput conflict with fairness in a heterogeneous cluster?

Chapter 8: Connections

Cluster scheduling is a living field. The techniques we've covered — FIFO, SJF, heterogeneity-aware allocation, placement sensitivity, fairness — are the foundation. Here's where the frontier is heading.

Advanced topics

TopicKey IdeaPaper
Elastic schedulingDynamically scale GPUs per job up/down based on cluster load. Pollux (OSDI '21) does this heterogeneity-agnostic; Sia (SOSP '23) combines elasticity with heterogeneity.Sia, SOSP 2023
Space sharingPack multiple jobs on one GPU. TGS intercepts CUDA kernels to manage contention.TGS, NSDI 2023
Transparent preemptionSingularity uses a device proxy at the OS level to checkpoint/migrate jobs without user code changes.Singularity, 2022
Gang schedulingAll GPUs for a distributed job must be allocated at once. Makes bin-packing harder but required for synchronous training.Alibaba trace, NSDI 2022
Spot/preemptible instancesUse cheap preemptible GPUs for fault-tolerant jobs. Requires robust checkpointing.Various cloud providers

Cheat sheet

ConceptOne-liner
JCTfinish_time − arrival_time (includes queue wait)
FIFOSimple, but short jobs wait behind long ones
SJFOptimal avg JCT, but needs duration estimates
Throughput matrix TTij = iter/s of job i on GPU type j
Allocation matrix XXij = fraction of time job i on GPU type j
Effective throughputj Tij · Xij
GavelExpress any policy as LP over X using T
PlacementCo-located GPUs → fast NVLink; split GPUs → slow network
Max-min fairnessMaximize the minimum normalized throughput
Gang schedulingAll GPUs allocated simultaneously or not at all
The big picture: Cluster scheduling for ML is a multi-objective optimization problem over a heterogeneous, topology-aware resource graph. Simple policies (FIFO) leave up to 3.5x performance on the table. The key insight from Gavel is that by expressing policies as optimization problems over allocation matrices, you can make any policy heterogeneity-aware with a straightforward LP. The field is now moving toward combining heterogeneity, elasticity, and topology awareness simultaneously.

Key papers

1. Narayanan et al. "Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads." OSDI 2020.
2. Weng et al. "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters." NSDI 2022.
3. Subramanya et al. "Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling." SOSP 2023.
4. Wu and Zhang et al. "Transparent GPU Sharing in Container Clouds for Deep Learning Workloads." NSDI 2023.
5. Shukla et al. "Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads." 2022.

Check: What is the main advantage of Gavel's LP-based approach to scheduling?