Cluster Scheduling for ML Workloads

Chapter 0: Why Cluster Scheduling?

You work at a company training ML models. Your team has 8 GPUs. This morning, five researchers submit jobs: Alice wants to train a large language model (needs 4 GPUs, takes 6 hours). Bob needs to fine-tune a vision model (2 GPUs, 1 hour). Carol is running a hyperparameter sweep (1 GPU each, 30 minutes per run). Dave has a critical demo tomorrow (2 GPUs, 3 hours). Eve is testing a quick experiment (1 GPU, 15 minutes).

Who goes first? If you run them in the order they arrive (FIFO), Eve's 15-minute experiment might wait 6 hours behind Alice's giant job. If you prioritize short jobs, Alice's important LLM training never finishes. If you give everyone equal resources, nobody finishes on time.

This is the cluster scheduling problem: given a queue of jobs with different sizes, durations, and resource needs, decide which jobs run when and on which GPUs. Get it wrong and you waste millions of dollars in idle hardware. Get it right and the same cluster serves twice as many researchers.

The core tension: Every scheduling decision is a tradeoff. Minimize average wait time? Short jobs love you, long jobs starve. Maximize GPU utilization? You pack the cluster tight but nobody gets dedicated resources. Maximize fairness? Everyone waits equally long, but total throughput drops. There is no free lunch.

GPU Cluster Queue Simulator

Watch jobs arrive and compete for 4 GPUs. Orange bars = running. Gray bars = waiting in queue. Notice how long some jobs wait.

At scale, this problem is enormous. The Alibaba ML cluster analyzed in this lecture has 6,700+ GPUs across 1,800 servers handling 1.2 million jobs. Microsoft's Philly cluster processes thousands of deep learning jobs daily. Every minute a GPU sits idle costs real money. Every minute a researcher waits is lost productivity.

Check: What makes cluster scheduling for ML particularly hard?

ML jobs are all the same size GPUs are too fast for scheduling to matter Jobs have wildly different sizes, durations, and resource needs, forcing hard tradeoffs

Chapter 1: Scheduling Fundamentals

Before we optimize anything, let's nail down the basic architecture. Every cluster scheduler has three components: a job queue, a scheduler, and a GPU cluster.

Job Queue

Users submit training jobs. Each job specifies: model type, number of GPUs needed, estimated duration, priority.

↓

Scheduler

Follows a policy (algorithm) to decide which job runs next and on which GPUs. This is the brain.

↓

GPU Cluster

Physical hardware: servers with 2-8 GPUs each, connected by high-speed networks. Jobs execute here.

The scheduler's job is to optimize some objective. Here are the metrics we care about:

Metric	Definition	What it measures
Job Completion Time (JCT)	finish_time − arrival_time	Total time from submission to completion, including queue wait
Makespan	last_finish − first_arrival	Total wall-clock time to complete a batch of jobs
Queueing Delay	start_time − arrival_time	How long a job waits before it starts running
GPU Utilization	busy_gpu_hours / total_gpu_hours	Fraction of GPUs actually doing useful work
Throughput	jobs_completed / time	How many jobs finish per unit time

Key insight: JCT includes queueing delay. A job that takes 1 hour to run but waits 5 hours in the queue has a JCT of 6 hours. This is what users actually experience — they don't care whether they're waiting in queue or waiting for computation. They just want results.

The scheduler also faces constraints. A job requesting 4 GPUs can't run on a 2-GPU node. A distributed training job may need all its GPUs on the same physical server for fast communication. Some jobs have deadlines. Some have higher priority. The scheduler must satisfy all constraints while optimizing the objective.

What makes ML scheduling special?

Traditional job schedulers (like in HPC or cloud computing) handle generic workloads. ML training adds three unique wrinkles:

1. Long-running jobs. A single LLM training run can take days or weeks. You can't just wait for it to finish — other jobs pile up behind it.

2. GPU heterogeneity. Your cluster might have V100s, A100s, and H100s. A ResNet trains 2x faster on an A100 than a V100, but a small BERT model barely benefits. The "best" GPU depends on the job.

3. Large checkpoints. ML models have gigabytes of state (weights, optimizer state). Preempting a job means saving all that state to disk and reloading it later. This isn't free.

Check: What does Job Completion Time (JCT) measure?

Only the time the job spends actually running on GPUs Total time from job submission to completion, including queue wait The time it takes to upload the model to the cluster

Chapter 2: Scheduling Policies

A scheduling policy is the algorithm the scheduler uses to decide which job runs next. Let's walk through three fundamental policies with a concrete example. We have 2 GPUs (one job per GPU) and four jobs:

Job	Duration	Arrival Time	Priority
Job 0	30	0	3 (low)
Job 1	10	0	1 (high)
Job 2	25	1	2 (medium)
Job 3	40	4	4 (lowest)

Policy 1: First-Come First-Served (FCFS / FIFO)

The simplest policy: run jobs in the order they arrive. No preemption — once a job starts, it runs to completion. This is what most people do by default.

Let's trace through it step by step. At time 0, Jobs 0 and 1 arrive. Both GPUs are free, so both start immediately. At time 1, Job 2 arrives but both GPUs are busy — it waits. At time 10, Job 1 finishes (JCT = 10). Job 2 starts on the freed GPU. At time 30, Job 0 finishes (JCT = 30). Job 3 starts. At time 35, Job 2 finishes (JCT = 35 − 1 = 34). At time 70, Job 3 finishes (JCT = 70 − 4 = 66).

FIFO result: Average JCT = (30 + 10 + 34 + 66) / 4 = 35. Makespan = 70. Job 3 waited 26 time units in the queue before even starting. That's queueing delay — time wasted doing nothing.

Policy 2: Shortest Job First (SJF)

When a GPU becomes free, schedule the job with the shortest remaining duration. SJF can also preempt: if a shorter job arrives while a longer one is running, pause the longer job and run the shorter one first.

At time 0, Jobs 0 and 1 start (both GPUs free). At time 1, Job 2 arrives (duration 25). Job 0 has 29 remaining, which is longer than 25, so Job 2 preempts Job 0. At time 10, Job 1 finishes (JCT = 10). Job 0 resumes. At time 26, Job 2 finishes (JCT = 25). At time 39, Job 0 finishes (JCT = 39). At time 66, Job 3 finishes (JCT = 62).

SJF result: Average JCT = (39 + 10 + 25 + 62) / 4 = 34. Makespan = 66. Both improve over FIFO. SJF is provably optimal for minimizing average JCT when job durations are known. The catch? In practice, you rarely know exact durations upfront.

Policy 3: Priority-Based

Each job has a priority. When a GPU opens up (or a higher-priority job arrives), run the highest-priority job. This is useful when some jobs are more important than others — a production model retrain might trump an experimental notebook.

With our priority assignments (Job 1 = highest priority), the priority-based policy preempts aggressively. The result: Average JCT = 43.5, Makespan = 53.

Policy Comparison Simulator

Select a scheduling policy and watch the timeline unfold. Compare average JCT and makespan across policies.

Policy	Avg JCT	Makespan	Pros	Cons
FIFO	35	70	Simple, fair in arrival order	Short jobs wait behind long ones
SJF	34	66	Optimal avg JCT	Requires knowing job duration; long jobs starve
Priority	43.5	53	Important jobs finish fast	Low-priority jobs may never run

Check: Why does SJF improve average JCT over FIFO?

Short jobs finish quickly, reducing the average wait for everyone behind them It uses more GPUs It makes jobs run faster on the hardware

Chapter 3: Production Traces

Theory is nice, but what do real ML clusters look like? Let's study a trace from Alibaba's production ML cluster, analyzed by Weng et al. (NSDI 2022). This cluster has 6,700+ GPUs (P100, V100, T4) across 1,800 servers, serving 1.2 million training and inference jobs.

What real jobs look like

The first surprise: job durations are heavy-tailed. Most jobs are short (minutes), but a small fraction run for days. This is exactly the scenario where FIFO fails badly — one long job blocks hundreds of short ones.

Statistic	Value
Total jobs	1.2 million
GPU types	P100, V100, T4
Servers	1,800 (2 or 8 GPUs each)
Gang-scheduled jobs	85%
Median job duration	~minutes
99th percentile duration	~days

Gang scheduling: 85% of jobs are gang-scheduled — all GPUs for a distributed training job must be allocated simultaneously. You can't give a 4-GPU job 2 GPUs now and 2 later. Either all 4 are available, or the job waits. This makes bin-packing much harder.

Queueing delay patterns

Two critical findings from the trace analysis:

Finding 1: Queueing delay increases as the resource requirement increases. A job requesting 1 GPU gets scheduled quickly. A job requesting 8 GPUs might wait hours because it needs an entire server to be free.

Finding 2: Queueing delay increases for more powerful GPU types. Everyone wants V100s, so the V100 queue is longer. P100s and T4s often sit idle even though they could run many jobs perfectly well.

Queueing Delay vs Resource Request

This chart shows how queueing delay grows with the number of GPUs requested. More GPUs = longer wait. Hover to see values.

Space sharing helps

Many ML jobs don't fully utilize their GPU. A small model might only use 30% of GPU memory and 40% of compute. Space sharing — packing multiple jobs on the same GPU — can dramatically improve utilization. The Alibaba trace showed that space sharing improved GPU utilization from ~50% to ~80%.

SJF works — if you can estimate durations

The researchers built a simple regression model to estimate job durations using features like the user tag, group tag, and number of GPUs requested. Even this rough estimate was enough for SJF to significantly reduce average JCT compared to FIFO. The lesson: you don't need perfect duration estimates — approximate ones already help.

The practical takeaway: Real clusters have heavy-tailed job distributions, gang scheduling constraints, and heterogeneous hardware. Simple FIFO is provably bad. Even a rough SJF (with estimated durations) beats FIFO significantly. But the real unlock comes from handling GPU heterogeneity — which is next.

Check: Why do jobs requesting more GPUs experience longer queueing delays?

More GPUs make the job run slower The scheduler must find enough free GPUs simultaneously, which is harder for larger requests Large jobs are always low priority

Chapter 4: The Heterogeneity Problem

Here's an insight that changes everything: not every job benefits equally from faster GPUs. This sounds obvious, but most schedulers completely ignore it.

Consider two jobs: a large Transformer (LLM) and a small ResNet (image classifier). Benchmark them across three GPU generations:

Job	K80 (old)	P100 (mid)	V100 (new)	Speedup K80→V100
Large Transformer	10 iter/s	30 iter/s	80 iter/s	8x
Small ResNet	50 iter/s	65 iter/s	70 iter/s	1.4x

The Transformer gets an 8x speedup on a V100 vs a K80. The ResNet gets only a 1.4x speedup. If you have one V100 and one K80, a naive scheduler might assign both jobs randomly. But the smart move is obvious: put the Transformer on the V100 (where it benefits massively) and the ResNet on the K80 (where it barely suffers).

The heterogeneity insight: Throughput is not portable across GPU types. A job that runs at 80 iter/s on a V100 might run at 10 iter/s on a K80 — or it might run at 65 iter/s. The ratio depends entirely on the model architecture. Schedulers that ignore this leave massive performance on the table.

Why does this happen?

Different GPU generations improve different things. The V100 added Tensor Cores (huge for matrix multiplications in Transformers), more memory bandwidth, and faster interconnects. A Transformer is bottlenecked by matrix multiply speed, so it benefits enormously. A small ResNet is bottlenecked by memory transfer and small kernel launches — things that didn't improve as much.

More concretely, the throughput matrix T captures this. Row i is job i, column j is GPU type j, and T_ij is the throughput (iterations/second) of job i on GPU type j:

T =
[ T_{transformer, K80} T_{transformer, P100} T_{transformer, V100} ]
[ T_{resnet, K80} T_{resnet, P100} T_{resnet, V100} ]

Plugging in our numbers: T = [[10, 30, 80], [50, 65, 70]]. The shape of each row tells the story. The Transformer row grows steeply — it's highly sensitive to GPU type. The ResNet row is nearly flat — it's insensitive.

Throughput Sensitivity Explorer

Each bar group shows a job's throughput across GPU types. Tall spread = heterogeneity-sensitive. Flat = insensitive. The sensitive job should get the fast GPU.

Transformer V100 throughput80

ResNet V100 throughput70

This is the foundation of Gavel, the heterogeneity-aware scheduler we'll study next. Instead of treating all GPUs as interchangeable, Gavel uses the throughput matrix to make smarter assignments.

Check: You have one V100 and one K80. A Transformer gets 80 iter/s on V100 and 10 on K80. A ResNet gets 70 on V100 and 50 on K80. Which assignment maximizes total throughput?

Transformer on V100, ResNet on K80 (total: 80 + 50 = 130) Transformer on K80, ResNet on V100 (total: 10 + 70 = 80) It doesn't matter — total throughput is the same

Chapter 5: Gavel — Heterogeneity-Aware Scheduling

Gavel (Narayanan et al., OSDI 2020) is a cluster scheduler that makes all existing scheduling policies heterogeneity-aware. The key idea: express scheduling as an optimization problem over an allocation matrix, using the throughput matrix to make GPU-type-aware decisions.

The allocation matrix

Suppose we have 3 GPU types (K80, P100, V100) and n jobs. The allocation matrix X is n × 3, where X_ij is the fraction of time job i spends on GPU type j. Each row sums to at most 1 (a job can't use more than 100% of its time). Each column's total can't exceed the number of available GPUs of that type.

X_ij = fraction of time job i spends on GPU type j

Constraints: ∑_j X_ij ≤ 1, ∑_i X_ij · scale_i ≤ N_j

The scale factor accounts for multi-GPU jobs: if a job needs 4 GPUs, each unit of allocation consumes 4 GPUs worth of cluster capacity.

Effective throughput

Given the throughput matrix T and allocation matrix X, the effective throughput of job m is:

effective_throughput(m) = ∑_j T_mj · X_mj

This is the throughput-weighted allocation — the total iterations/second job m achieves across all GPU types it uses. A job spending 60% of its time on V100s and 40% on K80s gets 0.6 × T_V100 + 0.4 × T_K80.

Maximizing total throughput

The simplest Gavel objective: maximize total effective throughput across all jobs.

maximize ∑_m ∑_j T_mj · X_mj

subject to: 0 ≤ X_ij ≤ 1, ∑_j X_ij ≤ 1, ∑_i X_ij · scale_i ≤ N_j

This is a linear program — solvable in milliseconds with off-the-shelf solvers like cvxpy. The throughput matrix T is measured by profiling each job on each GPU type (or estimated from prior runs).

Other objectives as linear programs

The power of Gavel is that any scheduling policy can be expressed this way:

Objective	Optimization
Max throughput	maximize ∑_m,j T_mj · X_mj
Max-min fairness	maximize min_m (effective_throughput(m) / equal_share_throughput(m))
Minimize SJF	minimize runtime of job with shortest remaining iterations / effective_throughput
Minimize makespan	minimize runtime of job with longest remaining time

The Gavel result: Heterogeneity-aware scheduling improves average JCT by up to 3.5x compared to heterogeneity-agnostic baselines. This is not a small improvement — it's the difference between your job finishing in 2 hours vs 7 hours.

Gavel Assignment Simulator

Drag jobs to GPU types and see total effective throughput change. Try to maximize it! The optimal assignment is shown as a target. Teal = your current throughput. Orange line = optimal.

Worked example

Suppose we have 2 K80s, 1 P100, and 1 V100. Three jobs, each needing 1 GPU:

	K80	P100	V100
Job A (Transformer)	10	30	80
Job B (ResNet)	50	65	70
Job C (GAN)	20	55	60

Naive assignment (round-robin): A→K80 (10), B→P100 (65), C→V100 (60). Total = 135.

Gavel-optimal: A→V100 (80), B→K80 (50), C→P100 (55). Total = 185. That's a 37% improvement just by being smart about assignment.

Why? The Transformer benefits enormously from V100 (8x over K80). The ResNet barely cares (1.4x). So give the V100 to the Transformer and the cheap K80 to the ResNet.

Check: What does Gavel's throughput matrix T capture?

The cost of each GPU type The throughput (iter/s) of each job on each GPU type — capturing heterogeneity The number of GPUs available of each type

Chapter 6: Placement Sensitivity

GPU scheduling isn't just about which type of GPU — it's also about where those GPUs are physically located. This is placement sensitivity.

Why placement matters

A distributed training job with 4 GPUs needs those GPUs to communicate during every training step (gradient all-reduce). If all 4 GPUs are on the same physical server, they communicate over NVLink — 600 GB/s of bandwidth. If they're split across two servers, they communicate over the network — maybe 25 GB/s (InfiniBand) or 10 GB/s (ethernet). That's a 24-60x difference.

Placement	Interconnect	Bandwidth	Gradient sync time (1 GB)
Same server (consolidated)	NVLink	~600 GB/s	~1.7 ms
Same rack (distributed)	InfiniBand	~25 GB/s	~40 ms
Cross-rack	Ethernet	~10 GB/s	~100 ms

The placement tradeoff: Consolidated placement (all GPUs on one server) is faster for communication-heavy jobs. But it's harder to find — you need a server with enough free GPUs. Distributed placement is easier to find but slower. The scheduler must decide: wait for consolidated placement, or start now with distributed?

Communication-bound vs compute-bound

Not every job cares about placement equally:

Communication-heavy jobs (large models with big gradient syncs): placement matters a LOT. A Transformer with 1B parameters has ~4 GB of gradients to sync every step. On NVLink, that's 6.7 ms. Cross-rack, that's 400 ms. Training time could double or triple with bad placement.

Compute-heavy jobs (small models, large batch sizes): placement barely matters. If each step takes 500 ms of computation and only 5 ms of communication, even a 10x slowdown in communication only adds 45 ms — a 9% increase.

Placement Impact Calculator

Adjust compute time and gradient size to see how placement affects total step time. Communication-heavy jobs (large gradient, small compute) suffer most from bad placement.

Compute time (ms)100

Gradient size (GB)1.0

Gavel incorporates placement

Gavel can extend the throughput matrix to include placement information. Instead of just T_ij = throughput of job i on GPU type j, we can have T_ijk = throughput of job i on GPU type j with placement type k (consolidated vs distributed). This lets the optimizer make placement-aware decisions automatically.

Check: A job has 200ms compute per step and syncs 0.5 GB of gradients. Would it benefit significantly from consolidated placement (NVLink at 600 GB/s vs ethernet at 10 GB/s)?

Yes — communication dominates No — 0.5/600=0.8ms vs 0.5/10=50ms. Compute is 200ms, so the slowdown is modest (200.8ms vs 250ms, ~25%) Placement never matters for gradients under 1 GB

Chapter 7: Fairness & Objectives

So far we've focused on throughput and JCT. But what about fairness? In a shared cluster, multiple teams submit jobs. If the scheduler always favors Team A's jobs (because they're shorter or higher-priority), Team B's jobs starve. This breeds resentment and politics.

Max-min fairness

The standard fairness notion in scheduling is max-min fairness: maximize the minimum allocation any job receives. In other words, make the worst-off job as well-off as possible.

In a homogeneous cluster, this is simple: give every job an equal share of GPUs. With 100 GPUs and 10 jobs, each gets 10. But in a heterogeneous cluster, "equal share" is ambiguous — equal share of what? GPU-hours? Throughput? Iterations?

Gavel's fairness formulation: Gavel normalizes each job's effective throughput by what it would get with an equal share of all resources. This makes throughputs comparable across jobs with different computational characteristics. The objective is: maximize min_m (effective_throughput(m) / equal_share_throughput(m)).

Finish-time fairness

An alternative: finish-time fairness ensures all jobs finish at roughly the same time (proportional to their total work). This is different from throughput fairness — a job with 2x the work should take 2x as long, not get 2x the resources.

The utilization-fairness tradeoff

Here's the fundamental tension: maximizing utilization and maximizing fairness often conflict.

Example: You have 2 V100s and 2 K80s. Job A is a Transformer (V100: 80 iter/s, K80: 10). Job B is a ResNet (V100: 70, K80: 50). Maximum throughput assigns both V100s to A and both K80s to B. Total: 160+100 = 260. But Job B gets only K80s while Job A gets the premium hardware — that's unfair.

Fair assignment: give each job one V100 and one K80. Job A: 80+10 = 90. Job B: 70+50 = 120. Total: 210. We sacrificed 19% throughput for fairness.

Fairness vs Throughput Explorer

The slider moves between max-throughput (left) and max-fairness (right). Watch how total throughput drops as fairness improves. The sweet spot depends on your organization's values.

Fairness weight0

Hierarchical scheduling

Large organizations often have a hierarchy: company → division → team. Each level may have different objectives. The company wants utilization. Each division wants fairness among its teams. Each team wants shortest JCT for its jobs. Gavel supports this by composing objective functions at each level of the hierarchy.

Check: Why does maximizing throughput conflict with fairness in a heterogeneous cluster?

Max throughput gives premium GPUs to jobs that benefit most, which means other jobs get worse hardware Fairness requires using fewer GPUs There is no conflict — both can always be maximized simultaneously

Chapter 8: Connections

Cluster scheduling is a living field. The techniques we've covered — FIFO, SJF, heterogeneity-aware allocation, placement sensitivity, fairness — are the foundation. Here's where the frontier is heading.

Advanced topics

Topic	Key Idea	Paper
Elastic scheduling	Dynamically scale GPUs per job up/down based on cluster load. Pollux (OSDI '21) does this heterogeneity-agnostic; Sia (SOSP '23) combines elasticity with heterogeneity.	Sia, SOSP 2023
Space sharing	Pack multiple jobs on one GPU. TGS intercepts CUDA kernels to manage contention.	TGS, NSDI 2023
Transparent preemption	Singularity uses a device proxy at the OS level to checkpoint/migrate jobs without user code changes.	Singularity, 2022
Gang scheduling	All GPUs for a distributed job must be allocated at once. Makes bin-packing harder but required for synchronous training.	Alibaba trace, NSDI 2022
Spot/preemptible instances	Use cheap preemptible GPUs for fault-tolerant jobs. Requires robust checkpointing.	Various cloud providers

Cheat sheet

Concept	One-liner
JCT	finish_time − arrival_time (includes queue wait)
FIFO	Simple, but short jobs wait behind long ones
SJF	Optimal avg JCT, but needs duration estimates
Throughput matrix T	T_ij = iter/s of job i on GPU type j
Allocation matrix X	X_ij = fraction of time job i on GPU type j
Effective throughput	∑_j T_ij · X_ij
Gavel	Express any policy as LP over X using T
Placement	Co-located GPUs → fast NVLink; split GPUs → slow network
Max-min fairness	Maximize the minimum normalized throughput
Gang scheduling	All GPUs allocated simultaneously or not at all

The big picture: Cluster scheduling for ML is a multi-objective optimization problem over a heterogeneous, topology-aware resource graph. Simple policies (FIFO) leave up to 3.5x performance on the table. The key insight from Gavel is that by expressing policies as optimization problems over allocation matrices, you can make any policy heterogeneity-aware with a straightforward LP. The field is now moving toward combining heterogeneity, elasticity, and topology awareness simultaneously.

Key papers

1. Narayanan et al. "Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads." OSDI 2020.
2. Weng et al. "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters." NSDI 2022.
3. Subramanya et al. "Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling." SOSP 2023.
4. Wu and Zhang et al. "Transparent GPU Sharing in Container Clouds for Deep Learning Workloads." NSDI 2023.
5. Shukla et al. "Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads." 2022.

Check: What is the main advantage of Gavel's LP-based approach to scheduling?

It runs jobs faster on each GPU It eliminates the need for GPUs It makes any scheduling policy heterogeneity-aware by optimizing over allocation and throughput matrices