System Design

How Midjourney Actually Works

From 10 engineers to 10,000 GPUs — the real architecture, the real tradeoffs, the real numbers.

Prerequisites: Basic client-server model + Curiosity. That's it.
12
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: The Architecture

Type /imagine. Ten seconds later, four images appear. This is everything in between.

Each chapter zooms into one block above. Click any component, or hit Trace Request to watch a prompt travel through the full pipeline.
Misconception: Midjourney is a small project. People see the Discord interface and assume it is a toy. It is not. Midjourney processes more GPU-hours per day than most Fortune 500 companies use in a year. It runs on approximately 10,000 GPUs[5], generates $500 million in annual revenue[10], and has been profitable since mid-2022[8] — all without a single dollar of venture capital. The Discord interface is not a limitation. It is a strategic weapon.

Computing the Compute: 5 Petaops Per Image

Let us unpack what "5 petaops per image" actually means, because this number drives every architectural decision in the system.[7]

A petaop is 1015 floating-point operations. A modern A100 GPU delivers roughly 312 TFLOPS (312×1012 FLOPS) at FP16 precision. So 5 petaops would take a single A100 approximately:

Time = 5×1015 ops ÷ 312×1012 FLOPS ≈ 16 seconds

That aligns remarkably well with Midjourney's reported generation times of 10–60 seconds (depending on model version and quality settings). The V8 model, at under 10 seconds, likely achieves this through a combination of fewer diffusion steps, more efficient attention mechanisms (Flash Attention[15], SageAttention[15]), and multi-GPU parallelism that distributes the 5 petaops across 2–4 GPUs simultaneously.[16]

Now multiply by daily volume. Each of the 2.5 million daily images requires 5 petaops. That is a total of:

Daily compute = 2.5M × 5 petaops = 12.5 exaops = 12.5×1018 operations

To put this in GPU-hours: if a single A100 delivers 312 TFLOPS, then 12.5 exaops takes 12.5×1018 ÷ 312×1012 ÷ 3600 ≈ 11,132 GPU-hours per day. At 10,000 GPUs, each GPU averages about 1.1 hours of active inference per day. The rest is overhead (scheduling, data transfer, health checks, model loading) and Relax mode's lower utilization. This rough math tells us Midjourney's GPUs are far from idle, but also not at 100% utilization — which is correct for a system that must absorb demand spikes.

The User Journey

Every image generation follows the same fundamental path, whether the user is on Discord or the newer web app. Understanding this path is the first step to understanding the architecture.

1. Prompt
User types a text prompt with optional parameters (aspect ratio, model version, stylize strength). This is just a string — maybe 50-200 characters. The prompt might include flags like --ar 16:9 --v 6 --stylize 750.
2. Moderation
Before anything touches a GPU, a text classifier scans the prompt for banned content.[20] If flagged, the request is rejected instantly. This saves compute on every blocked prompt.
3. Queue
The prompt enters a priority queue. Turbo and Fast subscribers jump ahead. Relax-mode users wait. The queue absorbs demand spikes so GPUs stay saturated, not overwhelmed.
4. Inference
A GPU cluster runs the diffusion model — a Diffusion Transformer (DiT)[14]. For V8, this takes under 10 seconds at up to 2K native resolution — a 5x improvement over V5.[17] The model generates 4 images simultaneously in one batch.
5. Post-Moderation
A post-generation image classifier scans the rendered images for visual content violations that the text filter could not predict.[20] Flagged grids are regenerated.
6. Storage
The 4-image grid is uploaded to cloud object storage and cached on a CDN. Over 55 million images sit on Discord's CDN alone — at least 148 TB.[23]
7. Delivery
The image URL is pushed back to the user as a Discord message attachment (or rendered in the web app). Total round-trip: 10-60 seconds depending on mode and queue depth.

Notice that this is a seven-step pipeline, not a simple request-response. Two of those steps are moderation (pre and post), which tells you something about the operational reality of serving generative AI at scale. The generation step itself — the actual GPU inference — is just one piece of the puzzle.

The simulation below animates this journey. Watch the prompt travel from the user through the cloud and back. Pay attention to where time is actually spent — the queue and inference steps dominate everything else.

Product Journey

Watch a prompt travel from user to image. Click Send Prompt to start. Notice where the animation slows down — that is where compute time accumulates.

The Scale

Numbers without context are meaningless. Let us put Midjourney's metrics side by side with familiar references so the scale becomes visceral. Each row below should make you pause and recalibrate your mental model of what a "small AI company" looks like.

MetricMidjourneyContext
Daily images2.5 million[12]Instagram gets ~100M photo uploads/day, but those are captured, not generated from scratch by a neural network
Registered users20 million Discord accounts[11]Roughly the population of Romania, or 3× the population of Hong Kong
Daily active users1.2–2.5 million[11]More than most AAA multiplayer games at peak. Comparable to Fortnite's average concurrent players.
Concurrent users~1 million[11]A sold-out NFL stadium holds 70,000. Midjourney has ~14 stadiums online simultaneously.
GPU fleet~10,000 GPUs[5]A top university HPC cluster has ~500-1,000 GPUs. Midjourney has 10-20 university clusters.
Revenue$500M/year (2025)[10]Revenue per employee: ~$2.6M. Google: ~$1.5M. Apple: ~$2.4M.
Revenue growth$50M → $200M → $300M → $500M[10]10x growth in 3 years (2022–2025). No sales team. No marketing budget.
Employees~192 (April 2026)[9]Grew from 10 in 2021 to 40 in 2023 to 192 in 2026. Still tiny for a $500M company.
Compute per image~5 petaops[7]A single iPhone 15 does ~17 teraops/sec. One Midjourney image would take an iPhone ~5 minutes of maxed-out compute.
Funding$0 VC[8]Stability AI raised $101M. Jasper raised $125M. Both lost market share to Midjourney. Bootstrapping won.
Revenue per employee matters. At $500M revenue and ~192 employees, Midjourney generates roughly $2.6 million per employee[25]. This metric reveals operational efficiency. Midjourney achieves top-tier revenue efficiency with no sales team, no marketing department, no enterprise sales motion, and no customer success organization. The product sells itself through the images users share on social media and in Discord channels. This is not an accident — it is a direct consequence of the Discord-first architecture.

Why Discord Was the Perfect Launchpad

In 2022, David Holz had a choice: build a web app from scratch or piggyback on an existing platform. He chose Discord, and that decision was arguably the most important architectural choice Midjourney ever made.[13] Here is why.

Zero acquisition cost. Discord already had 150 million monthly active users. Midjourney did not need to build authentication, user accounts, payment processing for the initial launch, social features, or a content sharing mechanism. All of that came for free. Users invited friends to the Midjourney Discord server the same way they invite friends to any Discord server — organically, socially, virally.

Built-in virality. When you generate an image in a Discord channel, everyone in that channel sees it. Your beautiful cyberpunk cityscape is not hidden behind a login wall — it is right there in the chat. Other users see it, type their own prompts, and the loop continues. This is the most efficient growth engine in consumer tech: the product advertises itself during use.

Natural rate limiting. Discord imposes its own rate limits on bot interactions. This gave Midjourney a built-in mechanism to throttle demand without building custom rate-limiting infrastructure. When demand exceeded capacity, users simply experienced slightly longer waits in the chat — a familiar Discord experience, not a frustrating error page.

Community as moat. The Midjourney Discord server became a community of millions of artists, designers, and hobbyists sharing techniques, prompts, and results. This community creates switching costs that no competitor can replicate by building a better model alone. You do not just use Midjourney; you belong to Midjourney.

Zero infrastructure for a 10-person team. In 2021, when Midjourney launched, the entire company was approximately 10 people.[9] Building a production web application with authentication, real-time messaging, payment processing, mobile apps, CDN, abuse prevention, and social features would have consumed the entire team for 6–12 months. By choosing Discord, they could focus 100% of engineering effort on the model and inference infrastructure — the parts that actually differentiate the product.

Web app came later. The web application at alpha.midjourney.com launched with V6.1 in August 2024[13], more than two years after the initial Discord launch. By then, the brand, community, and revenue engine were already established. The web app was an expansion, not a pivot — both interfaces hit the same backend infrastructure. Discord remains the primary interface.

Discord as Growth Engine: The Numbers

Let us quantify Discord's value as a growth platform. Midjourney grew from 0 to 20 million registered users[11] with zero marketing spend. How?

The viral loop. A user generates an image in a public Discord channel. Other users in the channel see the image. Some of them try generating their own images. Their images are also visible. Each image is simultaneously product, advertisement, and social proof. The viral coefficient (K-factor) is likely greater than 1, meaning each user brings in more than one additional user on average.

Compare this to a traditional web app. A user generates an image. Nobody else sees it unless the user actively shares it on social media (which requires effort). The viral coefficient is much lower — probably 0.2–0.4. Midjourney's Discord integration turns the K-factor from below 1 to above 1, enabling exponential growth instead of linear growth.

Community retention. Users do not just generate images — they participate in prompt-sharing channels, technique discussions, and community events. This social engagement creates daily habit patterns that increase retention. A user who visits a Discord server daily is far less likely to churn than a user who visits a web app only when they need an image.

The financial value of Discord to Midjourney is staggering. If a traditional customer acquisition cost (CAC) for a $30/month subscription is $20 (generous for consumer software), then acquiring 20 million users would cost $400 million. Midjourney spent $0. That $400 million of saved marketing cost is the implicit value of the Discord strategy.

Why This Product is Hard to Build

Many startups have tried to build "Midjourney competitors." Most fail not because their model is bad, but because they underestimate the system design challenges. Here are the five hardest problems, each of which we will solve in subsequent chapters.

1. GPU fleet management. Running 10,000 GPUs[5] is not "10,000 single-GPU jobs." It is multi-GPU inference coordination, model version management, thermal throttling, failure recovery, and capacity planning across time zones. This requires deep systems engineering.

2. Priority scheduling that is fair AND profitable. Turbo users must get sub-10-second results. Fast users must get 10-30 second results. Relax users must eventually get results. And the system must never violate these SLAs even under peak load. Building a scheduler that satisfies all three tiers simultaneously is a non-trivial constraint optimization problem.

3. Content moderation at 2.5M images/day. You cannot hire enough human moderators to review 2.5 million images daily. The two-stage automated pipeline[20] (text filter + image classifier) must be accurate enough to avoid false negatives (harmful content gets through) while fast enough to not add latency and cheap enough to not consume significant GPU resources.

4. Storage and delivery at petabyte scale. Every image ever generated must be stored permanently (users expect to access their history). The CDN must serve viral images globally with low latency. Storage costs grow monotonically. You need lifecycle policies, tiered storage, and efficient image encoding.

5. Revenue-cost alignment. Every GPU-hour must generate more revenue than it costs. This sounds simple but requires precise metering (tracking GPU-seconds per user), fair billing (the $4/hour rate[18]), and cost optimization (quantization, Flash Attention[15]) to maintain margins as users expect higher quality and higher resolution.

Each of these five challenges maps to one or more chapters in this lesson. GPU fleet management is Chapter 6. Priority scheduling is Chapter 4. Content moderation is revisited in Chapter 7. Storage and delivery are Chapter 8. Revenue-cost alignment threads through every chapter, because every architectural choice has a dollar sign attached to it.

A Note on David Holz

Understanding the founder helps explain the architectural philosophy. David Holz is a physicist and engineer who previously co-founded Leap Motion (hand-tracking hardware). He is not a web developer, not a social media executive, and not a venture-capital-funded growth hacker. He is a systems thinker who optimizes for efficiency, not headcount.

This explains several signature Midjourney decisions: the tiny team (~192 employees for a $500M company[9][10]), the refusal of VC money[8], the Discord-first approach (leverage existing infrastructure rather than build your own), and the aggressive inference optimization (V8's 5x speedup[17]). The architecture reflects the founder's values: do more with less, spend engineering effort only where it creates asymmetric value.

This is a systems problem, not a model problem. The model (DiT[14]) is important but replicable — competitors can and do train comparable diffusion transformers. The system (queue + scheduler + GPU fleet + moderation + CDN + billing) is the hard part. It is the system that makes 2.5 million images per day possible with 192 people. This lesson teaches the system.

The Subscription Tiers

Midjourney's pricing is remarkably simple for a product this complex. Four tiers, each differentiated primarily by GPU time allocation and concurrency.

PlanMonthly priceFast GPU timeConcurrency limitRelax mode
Basic$103.3 hours/month3 concurrent jobs[19]Not included
Standard$3015 hours/month3 concurrent jobsUnlimited
Pro$6030 hours/month12 concurrent jobsUnlimited
Mega$12060 hours/month15 concurrent jobs[19]Unlimited

At $4/GPU-hour[18], a Basic subscriber paying $10/month gets 3.3 hours of Fast GPU time — that is $13.20 of GPU time for $10. Midjourney loses money on Basic subscribers at marginal cost. But Basic subscribers become Standard subscribers, and Standard subscribers are profitable. The Basic tier is a customer acquisition tool, not a revenue driver.

The Standard tier at $30/month is likely where most revenue comes from. 15 hours of Fast time = $60 of GPU time at $4/hr. But the real cost to Midjourney is not $4/hr (cloud retail price) but their blended cost per GPU-hour, which is much lower with long-term contracts and high utilization. So the Standard tier is probably profitable at ~50%+ gross margin.

The Model: A Diffusion Transformer

The model at the heart of Midjourney is a Diffusion Transformer (DiT)[14] — a neural network that starts with pure noise and iteratively denoises it into a coherent image, guided by the text prompt. We will dive deep into the architecture in Chapter 5, but the key facts for now.

The diffusion process works in two phases. During training, the model learns to reverse a noise-adding process: given a noisy image, predict the original clean image (or equivalently, predict the noise that was added). During inference, the model starts from pure random noise and applies this learned denoising repeatedly — typically 20–50 steps — to sculpt the noise into a coherent image that matches the text prompt.

The Transformer part means the denoising network uses the same attention mechanism as GPT and other language models, rather than the U-Net architecture used by earlier diffusion models like Stable Diffusion 1.x. Transformers scale better with compute, which is why Midjourney invested in the DiT architecture for later versions.

The training infrastructure has undergone a dramatic shift. V4 and V5 were trained on Google TPU v4 pods using JAX[2], leveraging Google Cloud's TPU infrastructure.[1] But V8, released in March 2026, was a complete rewrite from JAX/TPU to PyTorch/GPU.[4] Inference has always run on GPUs — "huge clusters of GPUs" via Google Cloud NVIDIA GPU VMs.[3]

This dual-stack history matters for system design. It tells us that Midjourney treats training and inference as separate infrastructure problems with potentially different hardware, frameworks, and optimization strategies. This is confirmed by the 90/10 cost split: 90% of compute cost goes to inference, only 10% to training.[6] The inference stack is the one that needs to be optimized relentlessly, because it runs 24/7 under real-time latency constraints.

Training data. Midjourney's training data includes LAION-5B and what Holz described as a "big scrape of the internet."[22] V6 alone took 9 months to train from scratch[21] — a reminder that even at 10% of total cost, training is a massive investment in calendar time and engineering effort.

The 4-Image Grid Trick

One design decision that is easy to overlook: Midjourney generates four images per request, not one. This is not just a product feature — it is an architectural optimization with three distinct benefits.

Amortized overhead. Queue processing, prompt encoding, text embedding computation, job scheduling, and result delivery all happen once per request. By producing 4 images per request, the overhead-per-image drops to 25%. The text encoder (likely a CLIP-family model[14]) runs once regardless of how many images you generate from the same prompt.

Batch efficiency. Modern GPUs achieve peak throughput with larger batch sizes. The diffusion transformer processes 4 latent representations in parallel, which saturates the GPU's tensor cores more efficiently than processing 1. The wall-clock time for 4 images is roughly 1.5–2× the time for 1 image, not 4×.

Reduced re-rolls. From the user's perspective, four options are psychologically superior to one. Users are more likely to find at least one image they like, which reduces the re-roll rate (regenerating because the result was unsatisfactory). Every avoided re-roll saves a full job's worth of GPU compute.

From the user's perspective, four options feel generous. From the system's perspective, it is batching for throughput. The beauty is that both perspectives are correct, and they reinforce each other.

Let us quantify the batch efficiency. A single V8 image at 1024×1024 takes approximately 7 seconds of GPU time. A batch of 4 images from the same prompt takes approximately 10 seconds (not 28). The overhead savings are:

Batch speedup = (4 × 7s) ÷ 10s = 2.8x

This means the 4-image grid delivers 2.8x more images per GPU-second than generating 4 separate single-image requests. At scale, this saves approximately 1,800 GPU-hours per day (the difference between 3,472 and what it would be without batching). At $3/GPU-hour, that is $5,400/day or $2M/year in savings from a single product design decision.

Revenue Model

Midjourney's pricing revolves around GPU time. At approximately $4/hour of GPU time[18], users buy monthly plans that include a fixed allocation of "Fast" GPU minutes. When those run out, they can switch to "Relax" mode (lower priority, free but slower) or buy more Fast time.

This is a brilliant monetization structure because it directly ties revenue to the most expensive resource: GPU compute. Users who generate more images pay more. Users who are patient pay less. The system naturally load-balances itself — price-sensitive users shift to off-peak hours for faster Relax processing, smoothing the demand curve.

ModePriorityTypical WaitCost
TurboHighest<10 seconds (V8)2× Fast rate
FastHigh10–30 seconds~$4/GPU-hour[18]
RelaxLow30 seconds – 10 minutesIncluded (unlimited)
Concurrency limits as fairness. Midjourney limits how many simultaneous jobs a user can run: 3 for Basic subscribers, up to 15 for Mega subscribers, with a queue holding up to 10 pending jobs per user.[19] This is not just throttling — it is a fairness mechanism. Without concurrency limits, a single Mega subscriber running automated workflows could consume 15 GPU slots continuously, equivalent to hundreds of casual users. The limits ensure that GPU access is distributed across the user base.

Content Moderation: A System Component, Not an Afterthought

With 2.5 million images generated daily, content moderation is not optional — it is a core system component that directly affects GPU economics. Midjourney uses a two-stage moderation pipeline.[20]

Stage 1: Pre-generation text filter. Before the prompt enters the job queue, a text classifier scans it for banned content categories (violence, NSFW, specific public figures, copyrighted characters). This runs in single-digit milliseconds. Every banned prompt caught here saves 5 petaops of wasted GPU compute.

Stage 2: Post-generation image classifier. After the GPU renders the images, an image classification model scans the output for visual content violations. This catches cases where the prompt seemed innocuous but the result is not — a surprisingly common occurrence with generative models, because the training data contains associations that are difficult to predict from text alone.

Let us quantify the pre-generation filter's value. Suppose 5% of prompts are banned (a conservative estimate given the diversity of user intent). At 625,000 jobs/day, that is 31,250 jobs caught before inference. At 20 GPU-seconds per job, that is 625,000 GPU-seconds = 174 GPU-hours per day of saved compute. At $3/GPU-hour all-in cost, that is $522/day or $190,000/year saved by a text filter that costs pennies to run. The ROI on pre-generation moderation is extraordinary.

The Growth Story

Midjourney's growth trajectory tells us something important about the system design constraints at each stage.

PeriodTeamRevenueInfrastructureKey constraint
2021~10[9]Small GPU cluster, Discord bot onlyModel quality — making images good enough to share
2022~10-20$50M[10]Google Cloud, TPU v4 for training[1]Scaling inference to match viral demand
2023~40[9]$200M[10]~10K GPUs[5], JAX/TPU trainingQueue management, priority scheduling, profitability
2024~40-100$300M[10]Web app added[13]Multi-frontend support, storage growth
2025-26~192[9]$500M[10]V8 rewrite: PyTorch/GPU[4]Inference speed (5x improvement), 2K native resolution

At every stage, the binding constraint shifted. First it was model quality (nobody will pay for bad images). Then it was inference throughput (viral demand exceeded GPU capacity). Then it was scheduling fairness (paying users must not wait behind free users). Then it was multi-frontend support (not everyone wants to use Discord). And now it is inference speed and resolution (V8 needs to be 5x faster at 2x the resolution).

This progression mirrors most successful systems: the bottleneck migrates downstream as each layer is solved. A system architect must always know where the current bottleneck is and where it will move next.

The bottleneck migration pattern. In any system design interview, identifying the current bottleneck is necessary but not sufficient. Staff-level thinking requires anticipating where the bottleneck moves after you solve the current one. For Midjourney: once they solved GPU throughput (2022-2023), the bottleneck moved to queue scheduling fairness. Once they solved that, it moved to inference latency. The architecture must evolve with the bottleneck.

The V8 Inflection: JAX/TPU to PyTorch/GPU

One of the most dramatic events in Midjourney's technical history happened in March 2026: V8 was a complete rewrite from JAX/TPU to PyTorch/GPU.[4] This was not a minor migration. It was rewriting the entire training stack — the model code, the training loop, the data pipeline, the checkpoint format, the distributed training strategy — from scratch in a different framework on different hardware.

Why would a profitable company with a working system undertake such a massive rewrite? Several likely reasons:

Unified stack. Before V8, Midjourney had a split stack: training on JAX/TPU, inference on PyTorch/GPU. This meant every model needed to be converted from JAX format to PyTorch format before deployment. Every custom layer, every attention variant, every optimization had to be implemented twice. A unified PyTorch stack eliminates this translation layer and halves the code surface.

Ecosystem advantages. PyTorch has a larger ecosystem of inference optimizations: Flash Attention, SageAttention, TensorRT integration, xDiT for multi-GPU parallelism, quantization tools. These libraries are developed GPU-first. Midjourney's GitHub forks[15] confirm heavy investment in PyTorch-ecosystem attention optimizations.

Hardware flexibility. TPUs are only available from Google Cloud. GPUs are available from Google Cloud, AWS, Azure, Oracle, CoreWeave, Lambda, and dozens of other providers. Moving to GPU training gives Midjourney more negotiating power and redundancy options for its infrastructure.

V8's results speak. The rewrite produced a model that generates images 5x faster at up to 2K native resolution, in under 10 seconds.[17] Whatever the cost of the rewrite, the payoff in inference efficiency is transformative — each GPU can now serve 5x more images per hour.

Competitive Landscape

Midjourney does not exist in a vacuum. Understanding the competitive landscape reveals why certain architectural decisions were existential, not merely optimal.

CompetitorModelPricingKey difference
DALL-E 3 (OpenAI)Proprietary (likely DiT)$0.04-0.08/image via APIAPI-first, integrated with ChatGPT, lower image quality perception
Stable Diffusion (Stability AI)Open-source (U-Net → DiT)Free (local) / API pricingOpen-source, self-hostable, huge community, lower quality ceiling
Firefly (Adobe)ProprietaryIncluded in Creative CloudCommercially safe training data, integrated with Photoshop
IdeogramProprietaryFreemiumBetter text rendering in images, strong on typography
Flux (Black Forest Labs)Semi-openAPI pricingCreated by ex-Stability researchers, strong technical foundation

Midjourney's competitive moat is not the model alone (competitors catch up) or the price (comparable). It is the community + speed + consistency trifecta. The Discord community creates network effects. The inference speed (V8: sub-10 seconds) creates a responsive creative experience. The aesthetic consistency (Midjourney images have a recognizable "look") creates brand identity. Architecture enables all three.

The Upscale and Variation Loop

The user journey does not end with the 4-image grid. After receiving their grid, users interact with it through buttons that trigger additional GPU jobs:

Upscale (U1-U4). The user selects one of the four images and requests a high-resolution version. This runs the image through an upscaling model (likely a separate, lighter network) to produce a 2048×2048 or 4096×4096 output. Each upscale is an additional GPU job with its own queue priority and billing.

Variation (V1-V4). The user selects one image and requests "more like this." The system uses the selected image's seed and embedding as conditioning, with slight noise perturbation, to generate 4 new variations. This is a full inference job — same GPU cost as the original generation.

Remix. The user modifies the prompt text while keeping the same composition/structure. This re-runs inference with a new prompt but conditioned on the original image's latent, producing a hybrid result.

Reroll. The user re-runs the exact same prompt with a new random seed, generating a completely fresh 4-image grid.

Each of these interactions is an additional GPU job. The average user probably does 2–3 follow-up actions per initial generation (one upscale, one variation, maybe one reroll). This means the effective number of GPU jobs per user session is 3–4x the base generation count. The 2.5 million images/day figure[12] likely includes these follow-up operations, meaning the base "original prompt" volume is perhaps 600K-800K per day.

What We Will Build

Over the next 11 chapters, we will design every layer of this system. Not in the abstract — with actual numbers, actual data flows, actual technology choices, and actual tradeoffs. By the end, you will be able to draw Midjourney's architecture on a whiteboard from memory, explain every design decision, and defend those decisions against an interviewer's probing questions.

The journey looks like this:

Ch 1: The Numbers
Back-of-envelope estimation — derive QPS, fleet size, storage, bandwidth, and revenue per GPU-hour from first principles.
Ch 2: The Architecture
The six-layer component topology — the whiteboard diagram you draw in the first 5 minutes of a design interview.
Ch 3: Discord Bot & Gateway
How the frontend connects to the backend — Discord as an unconventional but strategic platform choice.
Ch 4-11: Deep Dives
Queue system, GPU scheduler, diffusion model, storage, CDN, monitoring, failure handling, and scaling evolution.
What makes Midjourney's scale unusual compared to other AI startups?

Chapter 1: The Numbers

Before we draw a single architecture box, we need to do something that separates staff engineers from everyone else in a system design interview: back-of-envelope estimation. Every architectural decision in a system this large is driven by numbers. How many requests per second? How many GPUs? How much storage? How much bandwidth? If you cannot estimate these from first principles, you cannot reason about tradeoffs.

In this chapter, we will derive every critical number from a single starting point: 2.5 million images per day.[12] That one fact, combined with basic arithmetic, will tell us the fleet size, the storage growth, the bandwidth bill, and even the revenue per GPU-hour. This is exactly the kind of estimation an interviewer expects in a system design round — and it is exactly the skill that most candidates lack.

The method is simple: start with what you know (daily images), derive what you need (QPS, GPU-hours, storage, bandwidth), and then check your answers against reality. If the derived numbers do not match the reported facts, you have found an insight.

Misconception: "Just spin up more GPUs." People assume scaling is simply a matter of adding hardware. It is not. Every GPU added increases coordination overhead (scheduling, health checks, failure recovery), network bandwidth requirements (model weights must be distributed, results must be collected), failure probability (at 10,000 GPUs, you lose ~1 GPU per hour statistically), and operational complexity (monitoring, alerting, capacity planning). The numbers we derive here expose the constraints that make naive scaling impossible. A 10,000-GPU fleet is not 10,000 independent machines — it is a distributed system with all the pain that implies.

Step 1: Queries Per Second (QPS)

Start with the daily volume and convert to a rate. This is always your first step in any estimation.

QPSavg = 2,500,000 images ÷ 86,400 seconds ≈ 29 QPS

But no system runs at average load. Real traffic follows a diurnal pattern — peaks during US and European evening hours (when people are off work and generating art for fun), valleys during the early morning hours. A standard rule of thumb for consumer products is that peak QPS is 3× average QPS. For global products with users across time zones, the ratio is lower (maybe 2x) because peaks flatten. For US-dominant products, it can be 4-5x.

QPSpeak = 29 × 3 ≈ 87 QPS

87 QPS for image generation. That sounds low compared to a web server handling 10,000 QPS — but each Midjourney request requires 10–60 seconds of GPU compute, not 50 milliseconds of CPU time. The throughput is low because the work per request is enormous.

This aligns well with the reported processing capacity of 20–40 jobs/second.[24] The apparent discrepancy resolves when you distinguish images from jobs: each job produces a 4-image grid.

Jobs vs. images — a crucial distinction. One user request = one job = one 4-image grid. So 2.5M images/day ÷ 4 = 625,000 jobs/day, which is ~7.2 jobs/second average, ~22 jobs/second peak. At 20–40 jobs/second capacity[24], the system has headroom — but not much. This is a system operating at high utilization by design. In an interview, always clarify: are we counting user requests, jobs, or output artifacts? The answer changes every downstream number.

Step 2: GPU Fleet Sizing

This is the most important estimation, because GPU compute is the dominant cost. Let us work through it carefully.

For Fast mode on V8: generation takes under 10 seconds.[17] Let us use 10 seconds as the upper bound. For a 4-image grid, the total GPU time per job depends on whether the 4 images are generated in parallel (on multiple GPUs) or sequentially. With multi-GPU inference parallelism via sequence parallelism and PipeFusion (confirmed by the xDiT fork on Midjourney's GitHub)[16], a single job likely uses 2–4 GPUs simultaneously for about 10 seconds.

Let us estimate conservatively: each job uses 2 GPUs for 10 seconds = 20 GPU-seconds per job.

GPU-seconds/day = 625,000 jobs × 20 GPU-sec/job = 12,500,000 GPU-sec
GPU-hours/day = 12,500,000 ÷ 3,600 = 3,472 GPU-hours/day

But that is just the raw inference time. Real systems have overhead that reduces effective utilization:

Overhead sourceImpact
Scheduling latencyTime between a GPU finishing one job and starting the next (queue polling, job setup)
Data transferLoading prompt embeddings, uploading results to storage
Model warmupIf a GPU switches model versions (V5 → V8), it must load new weights
Health checksPeriodic self-tests, memory checks, thermal monitoring
GPU failuresAt 10K GPUs, expect ~1 failure per hour. Recovery takes minutes.
Utilization gapsBetween peak and trough, some GPUs idle even with Relax mode backfill

A realistic GPU utilization of 60–70% means we need more raw GPU-hours than the inference math alone suggests.

Effective GPU-hours needed = 3,472 ÷ 0.65 ≈ 5,341 GPU-hours/day

At 24 hours per GPU per day: 5,341 ÷ 24 ≈ 223 GPUs for inference alone.

Wait — that is far fewer than the reported ~10,000 GPUs.[5] This is the most interesting part of the estimation: the gap between our naive calculation and reality. What accounts for the 45x difference?

FactorMultiplierExplanation
Relax mode volume~2xRelax jobs still consume GPU time — they are just lower priority. If 50% of total volume is Relax (reasonable for unlimited plans), double the GPU-hours.
Upscale and variation jobs~1.5xUsers upscale (U1-U4), vary (V1-V4), and remix images. These are additional GPU jobs not counted in the 2.5M base image figure.
Pre-V8 slower models~3xOlder model versions (V5, V5.2, V6, V6.1) take 30–60 seconds per job. Many users still use them. If the average job takes 40 GPU-seconds instead of 20, the GPU-hours double.
Multi-GPU per job~2-4xIf V8 actually uses 4–8 GPUs per job via PipeFusion/sequence parallelism[16], the GPU count multiplies accordingly.
Training allocation+10%10% of compute goes to training[6], which is ~1,000 GPUs dedicated to research and model development.
Redundancy and peak headroom~1.5xProduction systems need headroom for traffic spikes. Running at 50–70% average capacity means having 1.5x the minimum fleet.

Multiplying these factors: 223 × 2 × 1.5 × 2 × 2 × 1.1 × 1.5 ≈ 4,400 GPUs. Add the training allocation (~1,000) and we are at ~5,400. Still under 10,000, but the multi-GPU factor could easily push us there (if V8 uses 4 GPUs per job instead of 2, double the inference fleet).

The reconciliation: with multi-GPU inference, older model versions, upscales, Relax mode, training, and operational headroom, 10,000 GPUs is not only reasonable — it is probably tight.

Step 2b: The Diurnal Pattern

Real traffic is not uniform. Midjourney's users are heavily concentrated in North America and Europe, which means demand follows a strong diurnal (24-hour) cycle. Understanding this pattern is critical for capacity planning.

A typical day looks roughly like this:

Time (US Pacific)LoadWhat happens
2:00 AM – 8:00 AM~0.5x averageUS sleeping. European morning starts. Relax mode jobs drain.
8:00 AM – 12:00 PM~1.0x averageUS waking up. Europe at peak work hours (not generating art). Steady state.
12:00 PM – 6:00 PM~1.2x averageUS working but sneaking in generations. Europe evening. Building toward peak.
6:00 PM – 11:00 PM~2.5–3.0x averagePeak hours. US evening leisure time. This is when most art gets made.
11:00 PM – 2:00 AM~1.5x averageLate-night creators. US trailing off, Asia waking up.

The 3x peak-to-average ratio means the GPU fleet must be sized for peak, but pays for itself at average. During the 6-hour off-peak window (2-8 AM Pacific), ~4,500 GPU-hours are "spare." This is where Relax mode is brilliant: it fills those spare GPU-hours with unlimited free-tier generation, keeping utilization high even when paying demand drops.

Without Relax mode, those 4,500 GPU-hours would be wasted daily — at $3/GPU-hr, that is $13,500/day or $4.9 million per year of idle GPU cost. Relax mode turns waste into user goodwill and engagement.

Capacity planning insight. The diurnal pattern means you need enough GPUs for 3x average load, but you only earn revenue proportional to average load (Fast/Turbo users). Relax mode bridges this gap by monetizing off-peak capacity indirectly — Relax users become paying subscribers faster because they experience the product for free. This is a classic loss-leader strategy applied to GPU infrastructure.

Step 2c: Model Version Impact on Fleet Sizing

Not all Midjourney versions are equal in GPU requirements. Users can choose their model version, and many stick with older versions they are comfortable with. This creates a heterogeneous workload that complicates fleet management.

VersionApprox. generation timeGPU-seconds per job (est.)Architecture
V5 / V5.230–60 seconds60–120Likely U-Net based diffusion
V6 / V6.120–40 seconds40–80DiT with standard attention
V8<10 seconds[17]10–20Optimized DiT, Flash Attention, PyTorch[4]

If 30% of users still use V5/V6 (a reasonable estimate for the transition period), the average GPU-seconds per job is not 20 (V8 optimal) but closer to 40. This doubles the GPU-hours calculation and explains part of the gap between our naive estimate (223 GPUs) and reality (~10,000 GPUs).

Fleet management must also handle model affinity: a GPU that has V8 weights loaded in memory should preferably serve V8 jobs, because switching to V6 weights requires unloading and reloading several gigabytes of parameters (several seconds of downtime). This is why the scheduler is not just a simple FIFO dequeue — it is a match-making system that pairs jobs with compatible GPU workers.

Step 3: Storage

Each generated image is approximately 1–3 MB depending on resolution. V8 generates at up to 2K native resolution[17], so higher-resolution images are becoming the norm. Let us use 2 MB as the average for a single image. A 4-image grid might be stored as a single composite image (smaller than 4× due to JPEG compression of the grid) — estimate 4 MB per grid.

Storage/day = 625,000 grids × 4 MB = 2,500,000 MB ≈ 2.5 TB/day
Storage/month = 2.5 TB × 30 = 75 TB/month
Storage/year = 75 TB × 12 = 900 TB/year

And this is just new images. The total archive includes every image ever generated since launch. The 148+ TB figure on Discord's CDN alone[23] represents just the subset accessible through Discord links — the full corpus in cloud storage is likely several petabytes by now.

At Google Cloud Storage pricing of ~$0.02/GB/month for standard storage, 1 PB costs about $20,000/month. Not cheap, but dwarfed by GPU costs. Storage is not the bottleneck.

Step 4: Bandwidth

Every generated image must be delivered to the user. But it does not stop there — images are shared on social media, embedded in blogs, viewed by other users in Discord channels, and re-fetched when someone scrolls through a gallery. A CDN amplification factor of 3–5x is typical for viral visual content. Midjourney's content is especially viral because users share their best generations.

Bandwidth/day = 2.5 TB × 3 = 7.5 TB/day egress (conservative)

At $0.08/GB for cloud egress (Google Cloud standard pricing), that is:

Bandwidth cost/day = 7,500 GB × $0.08 = $600/day = $219,000/year

$219,000/year for bandwidth. The GPU fleet costs roughly $100M+/year. Bandwidth is a rounding error — about 0.2% of the GPU bill. This is typical for GPU-intensive workloads and explains why Midjourney does not need to obsess over CDN optimization. The compute dominates everything else by two orders of magnitude.

However, there is a hidden bandwidth cost: inter-GPU communication for multi-GPU inference. When a single job is split across 2–4 GPUs via sequence parallelism[16], those GPUs must exchange intermediate activations at every attention layer. For a DiT with, say, 32 attention layers and activations of ~100 MB per layer, that is 3.2 GB of inter-GPU transfer per job. At 625,000 jobs/day, that is 2 PB/day of internal network traffic — orders of magnitude more than the external CDN bandwidth. This is why GPU clusters need high-bandwidth interconnects (NVLink, InfiniBand) and why network topology matters for multi-GPU inference.

Step 4b: Latency Budget

In a system design interview, after estimating throughput and storage, you should present a latency budget: a breakdown of where time is spent in the end-to-end request path. This shows you understand the system at the operation level, not just the capacity level.

For a V8 Fast-mode request, the latency budget looks like this:

StepTimeWhat limits it
Discord webhook delivery~100msDiscord's API latency
Gateway processing~50msAuth lookup, rate check, parse, moderate
Queue enqueue~10msKafka/Redis write latency
Queue wait0–5,000msQueue depth, GPU availability
Job setup on GPU~200msModel weight verification, memory allocation
Text encoding (CLIP)~100msCLIP forward pass, once per job
Diffusion loop (25 steps)~7,000msDiT forward pass × 25, the bottleneck
VAE decode (latent → pixels)~200msSingle forward pass through decoder
Post-moderation~100msImage classifier forward pass
Image encode + upload~300msJPEG encode, GCS upload
Discord message edit~200msDiscord API call with attachment
Total~8–13 seconds

The diffusion loop (7 seconds) is ~60–80% of the total time. Everything else combined is under 2 seconds. This confirms what the cost analysis told us: inference dominates. Optimizing the diffusion loop (fewer steps, faster attention, multi-GPU parallelism) has 10x more impact than optimizing any other component.

Notice that the queue wait time is variable (0–5 seconds for Fast mode). During off-peak, a Fast job might start inference within 100ms of enqueue. During peak, the wait stretches to seconds. This variability is why the user-facing UX shows "Generating..." immediately — it masks the queue wait by conflating it with the inference time from the user's perspective.

Step 5: Revenue Per GPU-Hour

This is the number that tells us whether the business model works. It is the single most important metric for any GPU-intensive consumer product.

GPU-hours/year = 10,000 GPUs × 8,760 hours = 87,600,000 GPU-hours
Revenue per GPU-hour = $500M ÷ 87.6M GPU-hours ≈ $5.70/GPU-hour

Now the cost side. An H100 in the cloud costs roughly $2–3/hour depending on commitment level. Add networking, storage, monitoring, and operations overhead. Estimate $4/GPU-hour all-in.

Gross margin per GPU-hour = $5.70 − $4.00 = $1.70/GPU-hour
Annual gross profit = $1.70 × 87.6M = ~$149M

$149M in gross profit is plenty to fund a 192-person team (even at $300K average compensation per person, that is $57.6M in payroll) plus office space, legal, and other overhead. The math works.

The math confirms profitability. This back-of-envelope confirms Midjourney's profitability claim[8]. The revenue per GPU-hour ($5.70) exceeds the all-in cost per GPU-hour (~$4.00) by ~40%. The business is fundamentally healthy because GPU utilization is high (images are generated 24/7 globally), the price per image ($0.01–0.10 depending on plan) is low enough that users generate freely, and the Relax mode backfill ensures no GPU-hour goes to waste.

Step 6: The Cost Split — 90/10

One of the most revealing numbers: 90% of Midjourney's compute cost is inference, only 10% is training.[6] This is the inverse of what most people assume about AI companies.

Think about it. Training a model happens once per version (well, with checkpointing and iteration, but broadly once). V6 took 9 months to train from scratch[21], but that training produced a model that served hundreds of millions of images over the following year. The training cost is amortized across billions of images; the inference cost is paid per image, every image, forever.

Let us quantify. If 10% of compute cost goes to training on a $100M+ annual compute budget, that is ~$10M/year in training compute. That is 1,000 GPUs running for a year, or equivalently, a smaller cluster of 250 GPUs running for 3 months of focused training per model version. The remaining $90M goes to running 9,000 GPUs 24/7 for inference.

This has profound architectural implications that show up in every design decision:

ImplicationWhy it matters
Inference optimization is 9× more valuable than training optimizationA 10% speedup in inference saves $9M/year. A 10% speedup in training saves $1M/year.
GPU fleet is sized for inference, not training~9,000 of the 10,000 GPUs serve inference; ~1,000 are for training/research.
Model architecture is constrained by inference costA model that is 2× better but 3× slower at inference is a net loss financially.
Quantization and attention optimization are existentialFlash Attention[15], SageAttention (8-bit quantized)[15], and multi-GPU parallelism[16] directly impact the bottom line. Every 10% speedup is $9M/year.
V8's 5x speedup is transformativeIf V8 is 5x faster at inference[17], the same GPU fleet can serve 5x more images, or serve the same volume with 80% fewer GPUs.
Misconception: training is the expensive part. For research labs that train but do not serve (like some academic groups), training dominates the budget. For products at scale — Midjourney, ChatGPT, Stable Diffusion hosting — inference dominates by an order of magnitude. This is the 90/10 rule.[6] Every system design decision at Midjourney is ultimately an inference optimization decision. The interviewers know this. Candidates who focus on training optimization are solving the wrong problem.

The interactive calculator below lets you experiment with these numbers. Adjust the sliders to see how changes in daily volume, GPU time per image, fleet size, and revenue cascade through the entire estimation. The orange reference values show Midjourney's actual numbers for comparison.

Back-of-Envelope Calculator

Adjust the sliders to explore how scale numbers cascade. Orange values show Midjourney's actual numbers for comparison.

Daily images (M) 2.5M
GPU-sec per job 20s
Fleet size (GPUs) 10,000
Revenue ($M/yr) $500M

The Estimation Mindset

In a system design interview, the interviewer does not care if you get the exact number. They care about three things:

1. Can you identify the right starting facts? For Midjourney, the starting facts are: 2.5M images/day, ~10K GPUs, 10-60 seconds per generation, $500M revenue, 90/10 inference/training split. These are the "given" numbers that anchor everything.

2. Can you derive downstream numbers correctly? QPS from daily volume. GPU-hours from QPS and per-job time. Storage from image count and size. Bandwidth from storage with CDN amplification. Revenue per GPU-hour from total revenue and fleet size. Each derivation is a simple division or multiplication, but the chain of reasoning must be sound.

3. Can you identify the bottleneck? For Midjourney, the bottleneck is GPU compute. Not storage ($240K/year for a petabyte). Not bandwidth ($219K/year). Not network (87 QPS is trivial for a load balancer). GPU-hours are the scarce resource, and every architectural decision we study in the remaining chapters is ultimately an answer to one question: how do we maximize the value extracted from each GPU-hour?

The Complete Cost Stack

Let us assemble the full annual cost picture to see where money actually goes. This is the kind of analysis a finance-aware staff engineer would present to leadership.

Cost categoryAnnual estimate% of total
GPU compute (inference)~$90M (9,000 GPUs × 8,760 hrs × $1.14/GPU-hr blended)~60%
GPU compute (training)~$10M (1,000 GPUs × 8,760 hrs × $1.14/GPU-hr)~7%
Storage (GCS)~$500K (growing with archive, ~2 PB total)<1%
Bandwidth (egress)~$220K (7.5 TB/day × 365 × $0.08/GB)<1%
Networking (inter-GPU)~$5M (high-bandwidth interconnects for multi-GPU inference)~3%
Personnel (~192 employees)~$60M (at ~$310K avg total comp)~40%
Other (office, legal, etc.)~$10M~7%
Total estimated costs~$175M
Revenue$500M[10]
Estimated operating income~$325M (~65% margin)

A 65% operating margin with zero debt, zero VC obligations, and zero public-market pressure. This is an extraordinarily healthy business. The unit economics work because (1) GPU utilization is high thanks to Relax backfill, (2) the price per image covers the per-image GPU cost with margin, and (3) the team is tiny relative to revenue.

Why the numbers matter in interviews. When you present this cost stack on a whiteboard, you demonstrate three things: (1) you understand where money goes in an AI system (GPU compute dominates everything), (2) you can reason about business viability (revenue exceeds costs), and (3) you know what to optimize (inference cost, because it is 60% of the total). This is staff-level financial reasoning applied to system design.

Sensitivity Analysis: What If?

The back-of-envelope calculator above lets you explore scenarios, but here are the most interesting "what if" questions an interviewer might ask.

What if daily images doubled to 5M? GPU fleet would need to roughly double (from ~10K to ~20K). Revenue would likely increase proportionally (more paying users). The architecture does not fundamentally change — it is the same system, just scaled horizontally. This is a sign of good architecture: linear scaling with load.

What if inference became 5x faster (as V8 achieved)? Each GPU serves 5x more images per hour. Either (a) the same fleet serves 5x more users (revenue 5x), or (b) the fleet shrinks by 80% (cost 80% lower). Midjourney chose a mix: better quality (2K resolution costs more compute, partially offsetting the speedup) and faster user experience (lower latency drives higher engagement and thus more revenue).

What if GPU prices dropped 50% (next-gen hardware)? Compute cost drops from ~$100M to ~$50M, improving margins. But competitors benefit equally. The real advantage is in software optimization (Flash Attention, quantization, multi-GPU parallelism) that compounds on top of hardware improvements.

What if a free competitor emerged with equivalent quality? This is the existential threat. If open-source models (like Flux or future Stable Diffusion versions) reach Midjourney quality, the community moat and UX polish become the primary differentiators. The infrastructure would still need to serve the community, but pricing power would erode.

Summary: The Numbers That Matter

Here is the complete reference card of derived numbers. Memorize these for system design interviews — not for rote recitation, but because each number anchors a design decision.

MetricValueDerivation
QPS (avg)~292.5M images ÷ 86,400s
QPS (peak)~8729 × 3 (peak factor)
Jobs/day~625K2.5M images ÷ 4 per grid
GPU-hours/day (raw)~3,472625K × 20 GPU-sec ÷ 3600
Min GPUs (inference)~2233,472 ÷ 0.65 util ÷ 24h
Actual GPUs~10,000[5]Includes training, Relax, old models, multi-GPU, headroom
Storage/day~2.5 TB625K grids × 4 MB
Bandwidth/day~7.5 TB2.5 TB × 3 CDN amplification
Rev/GPU-hour~$5.70$500M ÷ (10K × 8,760h)
Cost/GPU-hour~$4.00Cloud cost + ops overhead
Gross margin/GPU-hr~$1.70$5.70 - $4.00
Training : Inference cost10 : 90[6]Training is amortized; inference runs 24/7
If Midjourney generates 2.5M images per day and each job produces 4 images requiring 20 GPU-seconds, approximately how many GPUs are needed for inference alone (assuming 65% utilization, 24/7 operation)?

Chapter 2: The Architecture

You have the numbers. 29 QPS average, 87 QPS peak, 10,000 GPUs, 2.5 TB of new images per day, 90% of cost in inference, and a 10–60 second generation time per job. Now we need a system that ties all of this together.

This chapter builds the map — the high-level component topology that every subsequent chapter will zoom into. If an interviewer asks you "design Midjourney," this is the diagram you draw on the whiteboard in the first five minutes. Get this right and the rest of the interview is filling in details. Get it wrong and no amount of detail saves you.

Misconception: this is a request-response system. The most common mistake in designing Midjourney is treating it like a REST API. It is not. A user sends a prompt and gets a response 10–60 seconds later. That is an asynchronous job processing system, not a synchronous API. This distinction fundamentally shapes every architectural choice. You cannot hold an HTTP connection open for 60 seconds at 87 QPS — you would need 5,220 concurrent connections just for the generation step, plus all the queueing and delivery overhead. HTTP proxies enforce 30-60s timeouts. Load balancers would struggle with connections that live for minutes. Instead, you accept the job immediately, process it asynchronously, and push the result back when ready.

The Six Layers

Every scalable system can be decomposed into layers of responsibility. Midjourney's architecture has six distinct layers, each solving a different problem at a different scale.

LayerComponentsResponsibilityScale Challenge
1. ClientDiscord Bot, Web AppAccept user input, display results, handle user interactions (upscale, vary, remix)1M concurrent users[11], two different frontend protocols
2. GatewayAPI Gateway, Load BalancerAuthentication, rate limiting, prompt parsing, content moderation, routingHandle bursty traffic, protect expensive downstream resources, 3-second Discord deadline
3. QueueJob Queue with priority lanesBuffer demand spikes, enforce priority contracts (Turbo/Fast/Relax), track job stateFair scheduling across tiers[18], handle 625K jobs/day, maintain ordering guarantees
4. ComputeGPU Inference Fleet, Job SchedulerRun the diffusion model, manage multi-GPU jobs, report progress10,000 GPUs[5], multi-GPU parallelism[16], multiple model versions, thermal management
5. StorageImage Store (GCS), Metadata DBPersist generated images, store job metadata, user history, prompt logs2.5 TB/day new data, petabytes total archive[23], fast write path from GPUs
6. DeliveryCDN, Discord CDNServe images to users with low latency, handle viral sharing traffic7.5+ TB/day egress, global edge distribution, CDN cache invalidation

Each layer has a different scaling profile. The Client layer scales with users (horizontally, trivially). The Gateway scales with requests (also horizontally). The Queue scales with job volume. The Compute layer scales with GPU count (the expensive dimension). Storage scales with total data volume (monotonically increasing, never shrinks). Delivery scales with read traffic (CDN handles this naturally).

The cost distribution is wildly uneven: the Compute layer alone accounts for ~90% of infrastructure cost.[6] The other five layers combined are a rounding error. This means the architecture is effectively a machine for feeding jobs to GPUs as efficiently as possible. Every other layer exists to serve the Compute layer.

Layer Dependencies and Failure Isolation

The six layers have carefully managed dependencies. Understanding these dependencies tells you what can fail independently and what cascades.

LayerDepends onIndependent of
ClientDiscord API (for Discord bot), Gateway (for web app)Queue, Compute, Storage, Delivery
GatewayAuth DB (user lookup), Queue (to enqueue)Compute, Storage, Delivery
QueuePersistent storage (durability)Client, Gateway, Delivery
ComputeQueue (to dequeue), Storage (to upload)Client, Gateway, Delivery
StorageGCS (availability)Client, Gateway, Queue
DeliveryStorage (to fetch images), Client (to push results)Gateway, Queue, Compute

The key insight is that the Queue decouples the acceptance path from the processing path. If the Compute layer goes down entirely (a catastrophic scenario), the Gateway and Queue keep accepting jobs. Users see "Your image is being generated" and wait longer, but they do not get errors. When Compute recovers, the queued jobs drain automatically. This decoupling is the entire reason the async architecture exists.

Similarly, if the Client layer (Discord) has an outage, the Compute layer keeps processing jobs. Results are stored but cannot be delivered until Discord recovers. When it does, the Delivery layer retries pushing the stored results. No work is lost.

This failure isolation is not accidental. It emerges from two design principles that apply to all distributed systems:

1. Never couple fast paths to slow paths. The gateway's response time (milliseconds) must never depend on the GPU's processing time (seconds). The queue is the buffer that decouples them. If the GPU is slow, the queue grows, but the gateway stays fast.

2. Store the work, not the connection. The job's state lives in the queue (durable storage), not in an HTTP connection (ephemeral). If any component crashes, the job persists and can be recovered. This is the fundamental difference between a message queue architecture and a connection-based architecture.

The Architecture Diagram

The interactive diagram below shows all six layers and the data flow between them. This is the diagram you would draw on a whiteboard. Click any layer to see its detailed description — the technology choices, the scale constraints, and the key design decisions.

Architecture Topology

Click any layer to see its details. Click Trace Request to animate a single request flowing through all six layers. This is the whiteboard diagram.

Data Flow: The Full Path

Let us trace a single request through every layer, from the moment a user types /imagine to the moment they see their image. This nine-step path takes 10–60 seconds end-to-end, but most of that time is spent in a single step (GPU inference). Understanding where time is spent tells you where to optimize.

1. User → Discord/Web
User types /imagine cyberpunk cityscape --ar 16:9 --v 8. The Discord client sends this as an Interaction to Discord's API, which forwards it to Midjourney's registered bot endpoint via webhook. On the web app, it goes directly to Midjourney's API gateway via HTTPS.
2. Bot/Gateway → Validation
The request is authenticated (Discord user ID or web session token), rate-limited (check concurrency: max 3–15 active jobs[19], max 10 queued), and the prompt is parsed for parameters. This takes <50ms.
3. Moderation → Text Filter
Pre-generation text filter scans the prompt for banned content[20]. If flagged, the request is rejected before consuming any GPU time. This saves ~174 GPU-hours per day (as we calculated in Chapter 1). Takes <10ms.
4. Gateway → Job Queue
The validated prompt is wrapped into a job object (prompt text, parsed parameters, user ID, model version, subscription tier, priority lane, callback target) and enqueued. The user receives an immediate acknowledgment: "Your image is being generated..." Takes <10ms.
5. Queue → GPU Scheduler
The scheduler dequeues jobs by priority (Turbo > Fast > Relax) and assigns them to available GPU worker(s). It must find the right worker(s) — one with the correct model version loaded, sufficient memory, and healthy status. If multi-GPU inference is needed[16], it must co-locate the workers. Wait time: 0–600 seconds depending on priority and queue depth.
6. GPU Worker → Inference
The worker encodes the prompt via the text encoder (CLIP-family), generates 4 initial noise tensors (one per grid image), and runs the diffusion transformer's denoising loop (20–50 steps). For V8, this takes <10 seconds for a 4-image grid at up to 2K resolution[17], using Flash Attention and SageAttention[15].
7. Post-Processing → Image Moderation
The generated images pass through a post-generation image classifier[20]. This catches visually problematic content that the text filter could not predict. Flagged images are replaced or the entire grid is regenerated. Takes ~100ms.
8. Storage → CDN
The 4-image grid is JPEG-encoded and uploaded to cloud object storage (likely GCS, given the Google Cloud partnership[1]). A CDN edge caches the image for fast global delivery. The CDN URL is generated. Takes ~200ms.
9. Delivery → User
The image URL is pushed back to the user. For Discord: the bot edits its acknowledgment message to embed the image as an attachment, with interactive buttons (U1-U4 for upscale, V1-V4 for variations, refresh for reroll). For web app: WebSocket push updates the gallery view. Takes ~100ms.

Time Budget Breakdown

Let us add up where time is actually spent for a Fast mode V8 request:

StepTime% of total
1-4. Ingestion (auth, parse, moderate, enqueue)<100ms<1%
5. Queue wait (Fast mode)0–5 seconds0–30%
6. GPU inference~8 seconds~60%
7. Post-moderation~100ms<1%
8. Storage upload~200ms~1.5%
9. Delivery~100ms<1%
Total~10–15 seconds

GPU inference dominates. Everything else combined takes under a second. This confirms our Chapter 1 finding: GPU compute is the bottleneck. All architectural optimization should focus on either reducing inference time (model optimization, quantization, Flash Attention) or maximizing GPU utilization (queue scheduling, Relax backfill, multi-GPU parallelism).

The Job State Machine

Every job in the system moves through a well-defined state machine. Understanding this state machine is essential for designing the queue, the scheduler, and the delivery system.

RECEIVED
Gateway has accepted the request. Prompt is parsed, user is authenticated, moderation passed. Job object created but not yet queued.
QUEUED
Job is in the priority queue, waiting for an available GPU worker. User sees "Your image is being generated..." Priority determines position: Turbo > Fast > Relax.
PROCESSING
A GPU worker has claimed the job and is running inference. A heartbeat signal confirms the worker is alive. If the heartbeat stops (GPU failure), the job returns to QUEUED. Progress updates are sent periodically.
UPLOADING
Inference complete. The image is being encoded and uploaded to storage. Post-moderation classifier is running. If moderation fails, job returns to QUEUED for regeneration (new seed).
COMPLETED
Image is stored, CDN URL is generated, result has been delivered to the user. Job metadata is archived. User's active job count decrements.

Two transitions deserve special attention:

PROCESSING → QUEUED (failure recovery). If the GPU worker dies mid-inference (hardware failure, OOM, thermal shutdown), the heartbeat times out and the job automatically re-enters the queue. The seed ensures the re-run produces identical images (deterministic generation). The user experiences a delay but not a failure. This is the key resilience mechanism.

UPLOADING → QUEUED (moderation rejection). If the post-generation image classifier flags the output, the job is re-queued with a new random seed. The system tries to generate a compliant image from the same prompt. If it fails multiple times (configurable, perhaps 3 attempts), the user receives a rejection message.

State machine thinking. In system design interviews, drawing the state machine for a job or request is one of the highest-signal things you can do. It shows you understand not just the happy path but also the failure paths, the retry logic, and the edge cases. Every async system has a state machine — make it explicit.

Why Async?

Let us do the math on why a synchronous request-response architecture would fail.

At peak, 22 jobs/second (87 QPS ÷ 4 images per job) with an average processing time of 30 seconds (blending Fast and Relax modes). If each request held a connection open for the full duration:

Concurrent connections = 22 jobs/sec × 30 sec = 660 connections

That is actually manageable for a modern load balancer. But the real problems are deeper.

Timeout chain. HTTP clients, proxies, CDNs, and load balancers each enforce timeout limits. A typical chain: browser (120s) → CloudFlare (100s) → nginx (60s) → application (30s). A Relax mode job that takes 5 minutes would be killed by every proxy in the chain. You would need to configure every component in the path with 10-minute timeouts, which creates resource exhaustion risks.

Connection resource waste. Holding an HTTP connection open for 30 seconds ties up a file descriptor, a thread (or coroutine), and memory on both the client and server. At 660 concurrent connections this is fine, but if you add long-tail Relax jobs (5+ minutes), the numbers balloon.

Retry complexity. If a connection drops mid-generation (network blip, client timeout, server restart), you lose the job. With async, the job exists independently of the connection — a dropped connection just means the delivery step retries.

The async model is simpler and more robust: accept the job in <100ms, return immediately with a job ID, process asynchronously, and push the result when ready. This decouples the frontend's response time from the backend's processing time. The frontend always responds in milliseconds, regardless of how long the GPU takes.

The async job pattern. In any system design interview involving long-running computation (image generation, video rendering, ML training, data pipelines, PDF generation, report building), the answer is always async job processing. Three rules: (1) the acceptance must be fast (<1 second), (2) the processing can take as long as it needs, (3) the delivery must be push-based (webhook, WebSocket, SSE), not poll-based. Midjourney is a textbook implementation of this pattern.

Priority Lanes

The queue is not a simple FIFO. It has three priority lanes that map directly to the subscription tiers and create a remarkably elegant resource allocation scheme.

LanePrioritySLABusiness Model
Turbo1 (highest)<10 seconds2× GPU-hour rate — premium for speed. Users who value time over money.
Fast210–30 secondsStandard GPU-hour rate (~$4/hr)[18]. The default paid experience.
Relax3 (lowest)Best-effort (30s – 10min)Unlimited, included in subscription — fills GPU idle time for zero marginal cost.

Relax mode is a masterstroke of resource economics. It implements a concept from operations research called yield management (the same principle airlines use with standby passengers).

When Fast and Turbo demand is high (peak hours), Relax jobs wait. The GPUs serve paying customers first. When demand drops (off-peak), Relax jobs fill the idle GPU time. The result: GPUs run at near-100% utilization 24/7. No GPU-hour is wasted. Paid users get priority. Free capacity is gifted to patient users, building goodwill and engagement.

From a queuing theory perspective, Relax mode converts Midjourney's GPU fleet from a loss system (where excess demand is dropped) into a delay system (where excess demand is buffered). This increases effective throughput without adding hardware.

The 4-Image Grid: An Architectural Choice

We mentioned this in Chapter 0, but it is worth emphasizing the full architectural dimension. Generating 4 images per job has four system-level benefits.

1. Amortized overhead. Queue scheduling, prompt encoding, job setup, result delivery, and moderation happen once per job. By producing 4 images per job, the overhead-per-image drops to 25%. The text encoder (CLIP-family) runs once; only the initial noise differs between the 4 images.

2. Batch efficiency. Modern GPUs achieve peak TFLOPS with larger batch sizes. A batch of 4 images through the DiT backbone saturates the GPU's streaming multiprocessors more efficiently than a batch of 1. Empirically, generating 4 images takes roughly 1.5–2× the wall-clock time of 1 image, not 4×.

3. Reduced re-rolls. Four options means users are more likely to find an acceptable result. This reduces re-roll requests by perhaps 2–3×, directly reducing GPU load per "satisfying result."

4. Engagement driver. The upscale (U1-U4) and variation (V1-V4) buttons on the 4-image grid encourage further interaction. Each upscale or variation is another job, driving more engagement and GPU time consumption — and for paid users, more revenue.

Latent batch vs. pixel batch. In a DiT architecture[14], the 4 images are generated as 4 independent noise samples in a single batch through the transformer backbone. The text encoder runs once (same prompt for all 4). The denoising loop processes all 4 latents simultaneously. Only the initial random noise (controlled by the seed) differs between the 4 images. This makes the 4-image grid nearly as efficient as a single image in terms of text encoder cost, and only ~2x the cost in terms of denoising compute.

Technology Choices

Midjourney keeps most infrastructure details private, but we can infer the likely technology stack from confirmed facts, GitHub activity, and standard practices for GPU-intensive workloads at this scale.

ComponentLikely TechnologyEvidence / Reasoning
Cloud providerGoogle Cloud PlatformConfirmed partnership[1]
Training (V4-V6)JAX on TPU v4Confirmed by Holz in GCP PR[2]
Training (V8+)PyTorch on NVIDIA GPUsConfirmed rewrite[4]
Inference frameworkPyTorch + custom CUDA kernelsGPU inference confirmed[3]; Flash Attention + SageAttention forks[15]
Model architectureDiffusion Transformer (DiT)xDiT fork on Midjourney GitHub[14]
Multi-GPU inferenceSequence parallelism / PipeFusionxDiT implements these[16]
Attention optimizationFlash Attention 2 + SageAttention (INT8)Both forked on Midjourney GitHub[15]
Object storageGoogle Cloud Storage (GCS)Inferred from GCP partnership
Container orchestrationKubernetes (GKE)Standard for GPU fleet management on GCP
Job queueKafka or Redis StreamsStandard for high-throughput priority job queues
Inference GPUsA100 / H100Standard for DiT inference at this scale

Component Interaction Patterns

The six layers interact through well-defined patterns that repeat in every large-scale distributed system. Recognizing these patterns is what separates a staff architect from a senior engineer — you see them once here, and you apply them everywhere.

Fan-out at the queue. A single job from the queue may fan out to multiple GPU workers (for multi-GPU inference via sequence parallelism[16]). The scheduler must coordinate this fan-out, track partial completion, handle worker failures, and collect results from all participating GPUs before declaring the job complete.

Write-behind to storage. The GPU worker writes the generated image to object storage asynchronously. The CDN URL is returned to the user before the image is fully replicated across all storage regions. This is acceptable because the CDN will fetch-on-miss from the primary region, and the image only needs to be available within seconds, not milliseconds.

Event-driven delivery. The result is pushed back to the user via an event (Discord bot editing its message, or WebSocket push to the web app). There is no polling loop — the system notifies the user proactively. This eliminates the thundering-herd problem of millions of clients polling for completion.

Backpressure via queue depth. When the GPU fleet is overwhelmed, the queue grows. The queue depth becomes the backpressure signal: the gateway can reject new Relax jobs when the Relax queue exceeds a threshold, and it can increase estimated wait times shown to users. The queue is the pressure valve that prevents GPU overload.

Failure Modes and Resilience

At 10,000 GPUs, failures are not exceptional events — they are a constant reality. Understanding how the architecture handles failures tells you whether the system was designed by someone who has actually operated infrastructure at scale.

FailureFrequencyImpactRecovery
Single GPU failure~1 per hour (at 10K GPUs)One in-flight job failsJob is re-queued automatically. User sees a slight delay, not an error.
Worker node failureSeveral per dayMultiple GPUs and jobs lostKubernetes reschedules pods. Jobs on those GPUs re-enter the queue. Scheduler avoids the failed node.
Queue system failureRare (Kafka is highly available)No new jobs acceptedGateway returns "temporarily unavailable." In-flight GPU jobs complete normally. Queue replays from last checkpoint.
Storage write failureOccasionalGenerated image cannot be savedRetry with exponential backoff. If persistent, alert ops. Image is buffered in GPU memory (briefly).
Discord API outageSeveral per yearCannot deliver results to Discord usersResults are stored. Delivery is retried when Discord recovers. Web app users unaffected.
Full GPU fleet saturationPeak hoursQueue grows, Relax wait increasesNot a failure — by design. Relax absorbs the pressure. Turbo and Fast still meet SLA.

The most important design principle here is idempotent job processing. Because a GPU failure can kill a job at any point during inference, every job must be safely re-runnable. The queue tracks job state (queued → processing → completed/failed), and any job that stays in "processing" too long (heartbeat timeout) is automatically re-queued. The seed ensures that a re-run produces the same images (deterministic generation), so the user does not notice the failure.

Design for failure, not against it. The difference between a system that "handles failures" and a system that "expects failures" is fundamental. Midjourney's architecture expects GPU failures constantly. Jobs are stateless (all state is in the queue). Workers are cattle, not pets (any GPU can run any job). Delivery is decoupled from processing (a GPU failure does not break the user's connection). This is the Netflix/Google philosophy of reliability: make components disposable, make the system resilient.

Comparing to Similar Systems

Midjourney's architecture pattern — async job queue with GPU compute backend — is not unique. Recognizing the pattern family helps you transfer knowledge from other systems you may have studied.

SystemPattern similarityKey difference
YouTube transcodingUpload → queue → GPU transcode → storage → CDNLonger jobs (minutes), fewer QPS, larger output files
Render farms (Pixar)Scene → queue → GPU render → storage → reviewHours-long jobs, internal only, no real-time latency requirement
ChatGPT inferencePrompt → queue → GPU inference → streaming responseStreaming output (token-by-token), not batch result. Much higher QPS but shorter jobs.
Spotify WrappedRequest → queue → compute → render → deliverOnce-per-year burst, pre-computed, much simpler model
CI/CD systems (GitHub Actions)Trigger → queue → worker pool → artifacts → notifyCPU-bound, minutes-long, highly variable compute requirements

The async job processing pattern is one of the most common in distributed systems. If you learn it deeply through Midjourney, you can apply it to any of these systems. The only variables are: job duration, compute type (GPU vs CPU), output size, latency expectations, and priority semantics.

Whiteboard Summary

If you had to draw Midjourney's architecture in 2 minutes on a whiteboard, here is the minimum viable diagram. Six boxes, five arrows, three annotations:

Discord / Web App
User-facing thin clients
↓ (auth, rate limit, moderate, parse)
API Gateway
The bouncer — reject bad requests fast
↓ (enqueue with priority)
Priority Job Queue
Turbo > Fast > Relax — yield management
↓ (schedule to available GPU workers)
GPU Fleet (10K GPUs)
DiT inference, multi-GPU parallelism, Flash Attention
↓ (upload image, push result)
Storage + CDN → User
GCS for persistence, CDN for delivery, event-driven push

Then annotate: "Async everywhere. Job acceptance is <100ms. Processing is 10-60 seconds. Delivery is push-based (message edit or WebSocket). GPU compute is 90% of cost. Relax mode fills idle capacity."

That is six boxes, five arrows, and five annotations. It takes 90 seconds to draw. It communicates the entire system at the right level of abstraction. And it gives the interviewer six entry points to drill into: "Tell me more about the Queue," "How does the GPU Scheduler work," "What happens when a GPU fails?" Each entry point leads to a 5-minute deep dive that you are now prepared for.

What Makes This Architecture Good

Before moving on, let us explicitly name the design qualities that make this architecture work at Midjourney's scale. These are the qualities an interviewer is evaluating when they ask "why did you design it this way?"

1. Clear separation of concerns. Each layer has one job. The Gateway does not generate images. The GPU fleet does not authenticate users. This makes each layer independently scalable, testable, and deployable.

2. Failure isolation. A GPU failure does not crash the Gateway. A Discord outage does not stop inference. The Queue is the circuit breaker between fast and slow components.

3. Elastic capacity. The system handles variable load through queueing (Relax absorbs excess demand) and priority scheduling (paying users get priority when capacity is scarce). No component needs to be provisioned for peak — only the GPUs, and Relax fills the off-peak gap.

4. Cost-proportional pricing. Revenue scales with the most expensive resource (GPU compute). The $4/GPU-hour pricing[18] directly ties user spending to infrastructure cost. The business model and the architecture reinforce each other.

5. Observable. Every job has a state (QUEUED, PROCESSING, COMPLETED). Every GPU has a heartbeat. Every queue has a depth metric. The system's health is observable at every layer, enabling proactive capacity management and fast incident response.

These five qualities are not unique to Midjourney. They are the hallmarks of any well-designed distributed system. YouTube's transcoding pipeline, Uber's ride-matching system, and Stripe's payment processing all exhibit the same qualities. Learning them here means recognizing them — and applying them — everywhere.

What the Next Chapters Cover

We now have the complete topology. The remaining chapters zoom into each component with implementation-level detail:

ChapterComponentKey question it answers
Ch 3Discord Bot & API GatewayHow does the frontend connect to the backend? Why was Discord strategic?
Ch 4Job Queue & Priority SchedulerHow do you fairly schedule Turbo/Fast/Relax with GPU constraints?
Ch 5Diffusion Transformer ModelWhat is a DiT? How does text become image? What makes V8 5x faster?
Ch 6GPU Fleet & Inference EngineHow do you manage 10K GPUs? Multi-GPU parallelism? Model versioning?
Ch 7Content Moderation PipelineHow do you moderate 2.5M images/day? Pre-gen + post-gen defense in depth.
Ch 8Storage & CDNHow do you store petabytes of images? Serve viral content globally?
Why must Midjourney's architecture be asynchronous rather than synchronous request-response?

Chapter 3: Discord Bot & API Gateway

In most system design problems, the frontend is an afterthought — "assume we have a web app." Midjourney is different. The Discord bot is not just a frontend; it is the most important architectural decision the company ever made. It eliminated entire categories of infrastructure that would have taken months to build, and it created a viral growth engine that no marketing budget could replicate.[13]

In this chapter, we will trace the exact path a message takes from the moment a user types /imagine to the moment the job enters the queue. Every step along this path is a design decision worth understanding, because each one has implications for latency, reliability, scalability, and cost.

Discord's Interaction API

Discord bots receive user commands through the Interactions API. This is a webhook-based system where Discord POSTs payloads to a registered endpoint when users invoke slash commands. Understanding the protocol is essential because it imposes hard constraints on Midjourney's architecture.

When a user types a slash command like /imagine, here is what happens at the protocol level:

1. User types /imagine
Discord's client shows autocomplete for registered slash commands. The user selects /imagine and types their prompt in the parameter field. This happens entirely within Discord's client — Midjourney's servers are not involved yet.
2. Discord API receives Interaction
Discord's server packages the command into an Interaction object — a JSON payload containing user ID, guild (server) ID, channel ID, command name, and parameters. Discord POSTs this payload to Midjourney's registered webhook URL.
3. Midjourney acknowledges (3 seconds)
Discord requires a response within 3 seconds or the interaction is marked as failed and the user sees "This interaction failed." Midjourney must acknowledge immediately with a DEFERRED_CHANNEL_MESSAGE_WITH_SOURCE response (type 5), then follow up later with the actual result.
4. Bot sends "Queued" message
Using the interaction's follow-up endpoint, the bot creates a message in the channel: "Your image is being generated..." with the prompt text echoed back. This gives the user immediate feedback.
5. Progress updates (message edits)
As the GPU generates the image, the bot edits its own message via Discord's PATCH /channels/{id}/messages/{id} endpoint. Each edit updates a percentage counter and may include a low-resolution preview (a blurry version that sharpens over time).
6. Final image (message edit + buttons)
When generation completes, the bot edits its message one final time to include the 4-image grid as a Discord attachment, plus interactive buttons: U1-U4 (upscale each quadrant), V1-V4 (create variations), and a reroll button. Each button press triggers a new Interaction, starting the cycle again.
The 3-second constraint is architectural destiny. Discord's requirement that bots respond within 3 seconds is one of the most underappreciated constraints in Midjourney's design. It makes synchronous processing mathematically impossible — you cannot run a 10-second diffusion model in 3 seconds. You must respond immediately with a deferred acknowledgment and then push the result later. Discord's API does not merely suggest good architecture; it enforces it. If Midjourney had built a web app first, they might have tried (and failed with) a synchronous approach before discovering the async pattern. Discord forced them to get it right from day one.

The Interaction Object

Let us look at what Discord actually sends to Midjourney's webhook. Understanding the data model helps you see what information is available for routing, authentication, and billing decisions.

json
{
  "id": "1234567890",
  "type": 2,                        // APPLICATION_COMMAND
  "application_id": "936929...",  // Midjourney's bot ID
  "guild_id": "662267...",        // Which server
  "channel_id": "995432...",      // Which channel
  "member": {
    "user": {
      "id": "448596...",          // Unique user ID → billing key
      "username": "artist42"
    }
  },
  "data": {
    "name": "imagine",              // Command name
    "options": [{
      "name": "prompt",
      "value": "a cyberpunk city --ar 16:9 --v 6 --stylize 750"
    }]
  }
}

The user ID is the billing key. It maps to a Midjourney subscription record that determines: which priority lane (Basic/Standard/Pro/Mega), how much Fast GPU time remains this month, the concurrency limit (3/5/12/15), and whether the user is in good standing (not banned, payment current).

What Discord Gives You for Free

To appreciate why Discord was a strategic choice, not a lazy one, consider the complete inventory of infrastructure Midjourney did NOT have to build.

ComponentDiscord providesCost to build from scratch
AuthenticationDiscord OAuth2, user identity, email verification, 2FA2–4 weeks of engineering + ongoing security maintenance + compliance
User accountsProfile, avatar, display name, account settings, email1–2 weeks + database schema + GDPR/CCPA compliance
Real-time messagingWebSocket infrastructure handling millions of concurrent connections, message delivery, offline queue, typing indicatorsMonths to build at scale. This alone is a company-scale problem.
Rate limitingPer-user and per-channel rate limits, bot rate limiting, global rate limitingCustom token bucket implementation + Redis + monitoring
Social graphFriends, servers, channels, roles, permissions, DMs, group DMsComplex graph database + API surface + privacy controls
Content sharingUsers share images by posting in channels; anyone in the channel sees themSharing links, embeds, OpenGraph metadata, permissions
Payments (initially)Discord's subscription infrastructureStripe integration, invoicing, tax compliance, refunds, fraud detection
Mobile appsiOS and Android apps, desktop apps (Windows, macOS, Linux) — all free2–3 native app teams, 6+ months each, plus ongoing maintenance
CDN for imagesDiscord hosts message attachments on its own CDN (cdn.discordapp.com)GCS/S3 + CloudFlare setup + egress cost management
Moderation toolsServer moderation, user banning, channel permissions, role-based accessAdmin panel, abuse detection, appeals workflow, trust & safety team
NotificationsPush notifications, email notifications, in-app badgesAPNs + FCM integration, notification preferences, delivery tracking

Conservative estimate: building all of this from scratch would take a team of 20+ engineers 6–12 months. Midjourney got it for free by writing a Discord bot. With a team of 10 engineers in 2021[9], there was literally no other viable path to launch.

Misconception: Discord was "lazy" or "temporary." People assume Midjourney used Discord because they were too small to build a real app. In reality, Discord was optimal at every stage of growth. Even after building the web app in August 2024[13], Discord remains the primary interface because of its community effects, viral sharing, and social proof. The decision was strategic architecture, not a compromise. In a system design interview, if you dismiss the Discord choice as "they should have built a web app," you are revealing that you do not understand platform leverage.

Discord as Rate Limiter

This is subtle but architecturally important. Discord imposes its own rate limits on bot interactions: a bot can send a limited number of messages per second per channel, and users can invoke slash commands at a limited rate. These limits act as a natural backpressure mechanism between users and Midjourney's backend.

Specifically, Discord's rate limits provide three layers of protection that Midjourney gets for free:

1. Per-user rate limiting. A user cannot spam slash commands faster than Discord allows. This prevents any single user from flooding Midjourney's gateway, even if they write a script to automate commands.

2. Per-channel rate limiting. In busy public channels, Discord throttles bot responses to prevent message spam. This naturally limits how fast Midjourney generates images in shared contexts.

3. Global bot rate limiting. Discord imposes overall rate limits on bot API calls (message sends, edits, etc.). This prevents Midjourney's progress updates and delivery messages from overwhelming Discord's infrastructure.

Without Discord, Midjourney would need to build its own rate limiting layer with per-user token buckets, global rate limiting, concurrency tracking (max 3–15 active jobs per user[19]), and abuse detection. Discord provides the first two for free. Midjourney only needs to implement concurrency tracking (application-level check: "how many active jobs does this user have?") and content moderation (the pre-generation text filter[20]).

Prompt Parsing

When a user types /imagine a cyberpunk city --ar 16:9 --v 6 --stylize 750 --no cars, the bot must parse this into a structured job object. The prompt text and flags need to be separated, validated, and converted into parameters that the inference pipeline understands.

pseudo
function parsePrompt(raw_text):
    # Split prompt text from parameter flags
    parts = raw_text.split("--")
    prompt_text = parts[0].strip()    # "a cyberpunk city"

    # Parse each flag into key-value pairs
    params = {}
    for part in parts[1:]:
        tokens = part.strip().split(" ", 1)
        key = tokens[0]
        value = tokens[1].strip() if len(tokens) > 1 else "true"
        params[key] = value

    # Result:
    # prompt_text = "a cyberpunk city"
    # params = {
    #   "ar": "16:9",       → aspect ratio
    #   "v": "6",           → model version
    #   "stylize": "750",   → creativity level (0-1000)
    #   "no": "cars"        → negative prompt
    # }

    # Validate and construct job spec
    return JobSpec(
        prompt       = prompt_text,
        aspect_ratio = parseAR(params.get("ar", "1:1")),
        model_ver    = int(params.get("v", 6)),
        stylize      = clamp(int(params.get("stylize", 100)), 0, 1000),
        neg_prompt   = params.get("no", ""),
        quality      = float(params.get("q", 1.0)),
        seed         = int(params.get("seed", random())),
        chaos        = clamp(int(params.get("chaos", 0)), 0, 100),
    )

Each parsed parameter affects inference behavior and system resource usage in specific ways:

ParameterWhat it controlsSystem impact
--arOutput aspect ratio (1:1, 16:9, 9:16, 4:3, etc.)Changes the latent tensor dimensions. 16:9 uses more VRAM than 1:1 at the same total pixel count. Extreme ratios (3:1) may exceed VRAM limits.
--vModel version (5, 5.2, 6, 6.1, 8)Different model weights, different architectures, different GPU requirements. The scheduler must route to a worker with the correct model loaded. Model switching takes seconds (weight loading).
--stylizeHow much the model deviates from literal prompt interpretationMay affect classifier-free guidance scale or the number of diffusion steps. Higher stylize = more GPU time.
--qualityRendering quality (0.25, 0.5, 1, 2)Directly controls inference time. Quality 2 uses 2x the diffusion steps of quality 1. Quality 0.25 is 4x faster but lower detail.
--noNegative prompt (things to exclude from the image)Adds to the text encoder input as negative conditioning. Minor compute overhead for the additional encoding pass.
--seedRandom seed for reproducible generationDetermines the initial noise tensor. Same seed + same prompt = same image. Essential for the "vary" feature.
--chaosVariation between the 4 grid images (0-100)Controls how different the 4 noise seeds are from each other. Higher chaos = more diverse grid = same GPU cost.

The Moderation Pipeline

Content moderation happens at two stages, and the pre-generation stage is the most cost-effective system component in the entire architecture. Let us understand why.[20]

Stage 1: Pre-generation text filter. Before the prompt enters the job queue, a text classifier scans it for banned content categories: explicit violence, NSFW content, specific public figures, copyrighted characters, and other policy violations. This classifier is fast — likely a small fine-tuned language model or even a sophisticated keyword/regex system running in single-digit milliseconds.

If the prompt is flagged, the user gets an immediate rejection message ("Your prompt was blocked by our content filter") and no GPU time is consumed. This is the key insight: every banned prompt caught at this stage saves 5 petaops of compute that would have been wasted generating an image that would be blocked anyway.

Let us quantify the savings rigorously. Suppose 5% of prompts violate content policy (a conservative estimate given the diversity of 2.5M images per day from 20M accounts).

Blocked jobs/day = 625,000 × 0.05 = 31,250 jobs
GPU-seconds saved = 31,250 × 20 = 625,000 GPU-sec = 174 GPU-hours/day
Annual savings = 174 × 365 × $3/GPU-hr = $190,530/year

The text filter itself probably costs less than $1,000/year to run (a small model on a single CPU instance). The ROI is roughly 190x. This is why the moderation pipeline is a system component, not a policy feature.

Stage 2: Post-generation image classifier. After the GPU generates the images, an image classification model scans the output for visually problematic content. This catches cases where the text prompt seemed innocuous but the generated image is not — a surprisingly common occurrence. The model's latent space contains associations from training data that can produce unexpected visual content from apparently benign prompts.

The post-generation classifier is more expensive (it runs on GPU, processing actual image pixels) but catches violations that no text filter can predict. Together, the two stages form a defense-in-depth pattern: cheap filter first, expensive filter second.

Defense in depth. The two-stage moderation pipeline is a classic security pattern applied to content moderation. The text filter is the cheap, fast, first line of defense that catches ~90% of violations. The image classifier is the expensive, thorough second line that catches the remaining ~10%. This same pattern appears in email (SPF check → spam classifier), web security (WAF → application firewall), and fraud detection (rule engine → ML model). In an interview, mentioning the two-stage pattern and quantifying the cost savings shows staff-level thinking.

The Progress Update Trick

One of the most user-visible pieces of engineering is the progress update. As the diffusion model runs, the Discord bot edits its message to show a blurry preview that progressively sharpens, along with a percentage counter. This is not just cosmetic — it is a critical UX decision with real engineering implications.

The diffusion model generates images iteratively, starting from pure noise and denoising over N steps (typically 20–50 steps). At each intermediate step, the current state of the latent can be decoded back into pixel space. The result is a blurry, noisy preview early on that becomes sharper and more detailed with each step. Midjourney periodically (every few steps) sends this intermediate rendering back to the Discord bot, which edits its message to show the preview.

pseudo
function generateWithProgress(job, discord_msg_id):
    # Step 1: Encode prompt text via CLIP-family model
    text_emb = encodeText(job.prompt, job.neg_prompt)

    # Step 2: Sample initial noise (4 images, one per grid slot)
    noise = sampleNoise(
        batch=4,
        seed=job.seed,
        height=job.latent_h,  # from aspect ratio
        width=job.latent_w
    )

    # Step 3: Iterative denoising loop
    num_steps = job.quality * 25  # quality=1 → 25 steps
    for step in range(num_steps):
        noise = denoise_step(noise, text_emb, step, num_steps)

        # Send progress preview every 5 steps
        if step % 5 == 0 and step > 0:
            preview = decodeLatent(noise)         # VAE decode
            preview_small = resize(preview, 256)  # Low-res for speed
            preview_url = uploadTemp(preview_small)
            pct = int((step / num_steps) * 100)
            editDiscordMsg(discord_msg_id,
                text=f"**{job.prompt}** — {pct}%",
                image=preview_url)

    # Step 4: Final decode at full resolution
    final_images = decodeLatent(noise)   # Full-res VAE decode
    grid = makeGrid(final_images, 2, 2) # 2×2 grid layout

    # Step 5: Post-moderation check
    if moderateImage(grid):
        return regenerate(job)  # Flagged → try again

    # Step 6: Upload and deliver
    final_url = uploadToGCS(grid)
    editDiscordMsg(discord_msg_id,
        text="",
        image=final_url,
        buttons=["U1","U2","U3","U4","V1","V2","V3","V4","🔄"])

The engineering cost of progress updates is not trivial. Each message edit is a Discord API call (subject to Discord's bot rate limits). Each preview requires a partial decode of the latent space through the VAE decoder (additional GPU time, though much less than a full denoising step). The preview images must be uploaded to temporary storage and then to Discord's CDN. For a 25-step generation sending previews every 5 steps, that is 4 additional API calls, 4 VAE decodes, and 4 image uploads per job.

But the UX benefit is enormous. Users see progress instead of staring at a loading spinner, which dramatically reduces perceived wait time. Psychological research shows that showing progress (even approximate) reduces perceived wait time by 30-40%. For a 10-second generation, that is the difference between "this is fast" and "why is this taking so long?"

The Web App: Same Backend, Different Frontend

When Midjourney launched alpha.midjourney.com with V6.1 in August 2024[13], the web app needed to connect to the same backend infrastructure as the Discord bot. This means the API gateway must route requests from two different frontends to the same job queue.

The architectural pattern is straightforward: both the Discord bot handler and the web app API are thin clients that validate input, check rate limits, and enqueue jobs. The entire inference, storage, and delivery pipeline is shared. The only difference is the delivery mechanism:

AspectDiscordWeb App
AuthenticationDiscord user ID from Interaction webhookSession token / JWT from Midjourney auth
Job submissionSlash command via Discord Interaction APIREST API call (POST /jobs)
Progress deliveryBot edits its own Discord messageWebSocket or Server-Sent Events push
Final deliveryDiscord message with image attachment + buttonsGallery UI update with image and action buttons
Rate limitingDiscord rate limits + Midjourney concurrency checkMidjourney rate limiter only (no Discord safety net)

The web app path requires Midjourney to build more infrastructure: their own authentication system, their own rate limiting, their own WebSocket service for real-time updates. This is precisely the infrastructure that Discord provided for free in the early days. Building the web app was only feasible once the team had grown and the revenue could fund the additional engineering.

Discord Message Flow

Watch a /imagine command flow through Discord's API into Midjourney's backend. Notice the 3-second acknowledgment deadline (red line) and the progress update cycle during GPU inference.

The API Question

Midjourney has no official public API. This is a deliberate product decision, not a technical limitation. Third-party tools that integrate with Midjourney do so by reverse-engineering Discord's Interaction API — they send bot commands through Discord as if they were a user, then listen for the bot's response messages.

This works, but it is fragile. Discord can change their API. Midjourney can detect and ban automated accounts. The interaction rate is limited by Discord's rate limits. And it violates Midjourney's Terms of Service.

Why no official API? Because an API would fundamentally change the economics. An API enables:

By forcing users through Discord or the web app, Midjourney maintains control over the user experience, rate limiting, and revenue per GPU-hour. The concurrency limits (3–15 per user[19]) and queue limits (10 pending jobs) are enforceable because there is no API bypass.

Both frontends, one backend. Whether a request comes from Discord or the web app, it hits the same API gateway, enters the same job queue, runs on the same GPU fleet, and is stored in the same object storage. The frontend is just a thin interface layer. All the complexity — and all the cost — lives in the queue, the scheduler, and the inference pipeline. These are the components we will explore in the next chapters.

Gateway Architecture

The API gateway sits between both frontends and the job queue. Its responsibilities form a clear, sequential pipeline where each step can reject the request early, saving cost on downstream processing.

1. Authentication
Verify the user identity (Discord user ID from the Interaction payload, or web session token). Look up their subscription tier (Basic/Standard/Pro/Mega), remaining GPU-hour quota, and account standing.
2. Rate Limiting
Check per-user concurrency: how many active jobs does this user have? If at their tier limit (3 for Basic, up to 15 for Mega[19]), reject with a friendly message. Check pending queue: if 10 jobs already queued for this user, reject.
3. Prompt Parsing
Extract prompt text and parameters. Validate parameter values: aspect ratio within allowed range (max ~3:1), model version exists (5, 5.2, 6, 6.1, 8), stylize 0–1000, quality in {0.25, 0.5, 1, 2}. Reject malformed input with a descriptive error message.
4. Content Moderation
Run the pre-generation text filter[20]. Reject banned prompts before they consume GPU time. Log flagged prompts (with user ID but without PII) for policy refinement and model retraining.
5. Job Construction
Build a complete job object: prompt text, parsed parameters, user ID, priority lane (Turbo/Fast/Relax based on user's mode selection and subscription tier), creation timestamp, and callback target (Discord message ID for Discord, WebSocket session ID for web app).
6. Enqueue
Push the job object into the appropriate priority lane of the job queue. Return immediately with a job ID and "queued" status. Total gateway processing time: <100ms.

Each of these steps takes single-digit milliseconds (except moderation, which might take 10-50ms depending on the text classifier's complexity). The entire pipeline runs in under 100 milliseconds — well within Discord's 3-second deadline.

The ordering of steps is deliberate: cheap checks first. Authentication is a simple database lookup. Rate limiting is a counter check. Prompt parsing is string manipulation. Only after all of these pass do we run the more expensive content moderation classifier. This is the bouncer pattern in distributed systems: the cheapest component in the pipeline does the most filtering, so that expensive components (GPUs at $3/hour) only process validated, prioritized, and moderation-cleared work.

Scaling the Gateway

At 22 jobs/second peak (87 QPS / 4 images per job), the gateway itself is not a bottleneck. Any modern API gateway (nginx, Envoy, Kong, or a custom Go/Rust service) can handle tens of thousands of requests per second. The gateway's scaling challenge is not throughput — it is consistency.

The concurrency check ("how many active jobs does this user have?") requires a shared counter that is consistent across all gateway instances. If two gateway instances both check simultaneously and both see "2 active jobs" when the limit is 3, they might both accept a new job, putting the user at 4 — exceeding the limit. This requires either a centralized counter (Redis INCR) or an eventually-consistent approach with occasional over-admission.

At Midjourney's scale (22 jobs/second, each involving one counter check), a single Redis instance handles this trivially. This is another area where Midjourney's relatively modest QPS (compared to, say, Google Search at 100K+ QPS) makes the engineering simpler than it might appear.

Why was Discord a strategic choice for Midjourney, not just a convenient one?

Chapter 4: The Queue & Job Orchestration

You just typed /imagine a cyberpunk fox reading a newspaper and hit Enter. Your prompt vanishes into a bot that serves twenty million Discord accounts[11]. Somewhere, a GPU needs to spend ten to sixty seconds generating your image. But right now, a million other people are online too[11]. How does the system decide who goes first, who waits, and who gets told "Queue full, try again later"?

This is a job orchestration problem, and it is the single most important piece of infrastructure between your prompt and your pixels. Get it wrong, and paying users wait behind free-tier floods. Get it right, and you can serve 2.5 million images per day[12] with a team of forty engineers[9].

Why FIFO Doesn't Work

The naive approach is a First-In, First-Out queue. Jobs enter at the back, leave at the front. Simple, fair, and completely wrong for a business that charges $10-120/month for faster access.

Consider: a Relax-mode user submits 50 images. Each takes 30 seconds of GPU time. That is 25 minutes of GPU capacity consumed. If a Turbo-mode user (paying 2x the GPU cost[27]) joins the back of the line, they wait behind all 50 Relax jobs. The person paying more gets worse service than the person paying nothing. Revenue collapses. This is why every serious generation service uses priority queues.

Common misconception: "Just add more GPUs." You cannot outrun a queue design problem with hardware. Even with 10,000 GPUs[5], if your queue doesn't prioritize correctly, a burst of Relax jobs will delay Turbo users. Priority is a software problem, not a hardware problem.

Three Priority Lanes

Midjourney solves this with three priority lanes, each with different latency guarantees and cost structures:

ModeSpeedGPU CostWait TimeUse Case
Turbo4x faster generation2x per imageNear-zero queueUrgent iteration
FastNormal speed1x per image~5-60 secondsDefault workflow
RelaxNormal speed0x (included)0-30 minutesBulk exploration

Turbo mode is not just "jump the queue." The generation itself runs faster too — likely using more GPUs per image or more aggressive parallelism[27]. Think of it as both priority AND resource allocation. Fast mode is the default — about one minute of GPU time per image[26]. Relax mode is the "I have time" lane — you get unlimited images, but you wait until GPUs are available.

Per-User Concurrency Limits

Priority lanes alone are not enough. Without limits, a single Pro user could flood the Turbo lane with hundreds of concurrent jobs. So Midjourney enforces per-user concurrency caps:

PlanConcurrent JobsQueue LimitMonthly Cost
Basic310$10
Standard310$30
Pro1210$60
Mega1510$120

The queue limit is separate from concurrency[19]. You can have 3 jobs actively processing AND 10 more waiting in your personal queue. Submit an 11th and you get "Queue full." This prevents any user from monopolizing dispatch bandwidth.

Job Lifecycle

Every generation request passes through a well-defined state machine. Understanding these states is critical for designing retry logic, dead letter queues, and monitoring dashboards:

SUBMITTED
Bot receives /imagine command. Prompt is parsed, parameters extracted, moderation check initiated.
QUEUED
Job enters the priority lane (Turbo/Fast/Relax). Position depends on plan tier, lane load, and submission timestamp.
DISPATCHED
Orchestrator assigns job to a GPU worker with available memory. Worker acknowledges receipt. Clock starts for timeout.
PROCESSING
GPU runs 20-75 denoising steps. Progress updates sent back to Discord (0%, 25%, 50%, 75%, 100%). Typically 10-50 seconds.
COMPLETED / FAILED
Success: image uploaded, Discord message updated with result. Failure: GPU OOM, timeout, or moderation rejection → dead letter queue for analysis.

Dead Letter Queues

What happens when a job fails? GPU runs out of memory (a 2K image at high upscale can exceed VRAM). The diffusion process produces a degenerate output. The worker crashes mid-inference. You cannot just drop these jobs — the user is staring at a progress bar.

A dead letter queue (DLQ) captures failed jobs with their full context: the prompt, parameters, error code, GPU ID, and timestamp. The orchestrator can then retry on a different worker (maybe one with more VRAM), notify the user if retries are exhausted, and aggregate failure patterns for debugging (e.g., "all failures are on GPU node 47 — it has bad memory").

Batch Optimization

Here is a subtle but powerful optimization. GPU utilization is highest when you fill all available compute. A single 512×512 image might only use 40% of a GPU's capacity. But Midjourney generates 4 variations per request[12]. That is four images batched into one GPU dispatch, amortizing the model loading and memory allocation overhead.

The orchestrator can go further: group multiple users' small jobs onto the same GPU if they fit. This is bin packing — the same problem that Kubernetes solves for CPU containers, but applied to GPU memory and compute slots.

Worked Example: Capacity Planning

Back-of-envelope capacity. Let us verify that the numbers add up. Midjourney processes 20-40 jobs per second[24]. Each job takes roughly 10-50 seconds of GPU time. At the midpoint (30 jobs/sec, 30 sec each): 30 × 30 = 900 concurrent GPU jobs. With ~10,000 GPUs[5], that is only 9% average utilization — leaving massive headroom for burst traffic and Relax queue draining. The math checks out.

Let us break this down further. Daily throughput: 2.5 million images[12] divided by 86,400 seconds per day = 29 images per second. That aligns perfectly with the 20-40 jobs/sec range. Each image consumes roughly 5 petaops[7]. At 29 images/sec, total throughput is 145 petaops/sec. An A100 delivers ~312 TFLOPS (FP16), so you need 145,000 / 312 = ~465 GPUs continuously saturated. With 10,000 GPUs, that is 4.6% average utilization — the rest handles bursts, Turbo mode parallelism, and the long tail of Relax jobs that accumulate during peak hours.

The Technology Stack

Midjourney has not publicly disclosed their queue technology, but the workload pattern strongly suggests either Apache Kafka or Redis Streams. Both support multiple consumer groups (one per priority lane), exactly-once processing semantics, and the ability to replay failed jobs. Kafka is the more common choice at this scale — it handles millions of messages per second, provides durable storage, and integrates naturally with Kubernetes-based GPU orchestration.

The dispatcher likely runs as a separate service that polls all three lanes, preferring Turbo over Fast over Relax, and matches jobs to available GPU workers based on memory requirements, current load, and geographic proximity (to minimize data transfer latency).

Priority Queue Visualizer

Watch jobs flow through three priority lanes. Click "Add Job" to inject jobs into different lanes and see how priority affects dispatch order. Turbo jobs (orange) always dispatch first, then Fast (teal), then Relax (purple).

Dispatched: 0 | Queued: 0 | GPU Workers: 4/4 free
A Midjourney Pro user ($60/month, 12 concurrent jobs) submits 15 images in Fast mode. What happens to the last 5?

Chapter 5: The GPU Inference Pipeline

The queue dispatched your job to a GPU worker. Now what? This is where the real compute happens — and where 90% of Midjourney's cost goes[6]. A single image generation burns through 5 petaops (5 × 1015 floating-point operations)[7]. To put that in perspective, multiplying two 1000×1000 matrices takes 2 billion operations. Generating one Midjourney image is equivalent to doing that 2.5 million times.

Where do all those operations go? Into the iterative diffusion process that transforms random noise into a coherent image, one small step at a time.

Diffusion from Zero

Imagine you have a photograph. Now add a tiny bit of random noise — like TV static mixed into every pixel. Do it again. And again. After a thousand steps of adding noise, the original image is completely destroyed. All you have is pure random static. Diffusion models learn to reverse this process.

The model is trained by showing it millions of images at various noise levels and teaching it to predict what the "less noisy" version looks like. At inference time, you start with pure noise and ask the model: "If this noise came from a real image, what would one step of denoising look like?" Then you feed that slightly-less-noisy result back in and ask again. After 20-75 steps, structure emerges from chaos.

The key insight: Each denoising step requires a full forward pass through the entire neural network — billions of parameters, every layer, every attention head. That is why diffusion is so compute-intensive. It is not one inference call. It is 20-75 inference calls, sequentially, with each one's output feeding the next.

Why 5 Petaops?

Let us derive this number. A modern DiT (Diffusion Transformer) model has roughly 2-10 billion parameters[30]. Each forward pass through a transformer requires approximately 2 × N operations per parameter (one multiply, one add). For a 5B parameter model, that is 1010 operations per forward pass. With 50 denoising steps, that is 50 × 1010 = 5 × 1011. But this ignores the attention mechanism's quadratic cost. For a 1024×1024 image tokenized into 4096 patches, self-attention alone costs 2 × 40962 × dmodel per layer. Multiply by 30+ transformer layers, multiply by 50 steps, and you land squarely at 5 × 1015 = 5 petaops.

FLOPstotal = steps × (2 × Nparams + layers × 2 × seq2 × dmodel)

Plugging in: 50 steps × (2 × 5×109 + 32 × 2 × 40962 × 1280) ≈ 5 × 1015.

The DiT Architecture

Midjourney uses a Diffusion Transformer (DiT) — confirmed by their public fork of the xDiT parallelism library on GitHub[14]. This is a critical architectural choice that distinguishes modern image generators from earlier ones.

Older diffusion models (Stable Diffusion v1, DALL-E 2) used U-Net architectures — a convolutional neural network with skip connections. U-Nets work well at moderate resolutions, but their convolutional layers have fixed receptive fields. Every pixel only "sees" its local neighborhood. To capture global relationships (like "the fox's newspaper should have text that matches the lighting"), you need many layers and careful upsampling.

DiT replaces convolutions with transformer blocks — the same self-attention mechanism that powers GPT. Every image patch attends to every other patch at every layer. Global coherence is built-in, not bolted-on. The trade-off: attention scales quadratically with the number of patches. A 1024×1024 image with 16×16 patches has 4096 tokens. Self-attention on 4096 tokens is manageable. At 2048×2048 (16,384 tokens), it becomes brutal without optimization.

PropertyU-Net (older)DiT (Midjourney V5+)
Core operationConvolution (local)Self-attention (global)
Global coherenceRequires many layersBuilt-in at every layer
ScalingO(n) in resolutionO(n2) in tokens
ParallelismLimitedSequence/tensor/pipeline parallel
Training efficiencyPlateaus at large scaleFollows compute scaling laws

Text Conditioning: From Prompt to Cross-Attention

How does the model know you asked for a "cyberpunk fox"? Through text conditioning. Your prompt is first encoded by a text encoder — likely a CLIP-family model — into a sequence of embedding vectors. Each word (or subword token) becomes a vector of ~768-1024 dimensions.

These text embeddings enter the DiT through cross-attention. At every transformer layer, the image patches (queries) attend to the text embeddings (keys and values). This is how the model steers: patches that should depict "fox" receive high attention weights from the "fox" text token. The --stylize parameter controls how strongly the model follows these text signals versus its own aesthetic training — high stylize means "make it beautiful even if it drifts from the prompt."

Multi-GPU Parallelism: xDiT

A single high-resolution image can exceed what one GPU can handle. Midjourney's xDiT fork[16] supports three parallelism strategies, each splitting the workload differently:

Three flavors of parallelism. Sequence parallelism: split the 4096 image patches across GPUs. Each GPU handles 1024 patches. Attention requires all-to-all communication between GPUs (every patch must attend to every other), but the per-GPU memory drops by 4x.

Tensor parallelism: split each transformer layer's weight matrices across GPUs. Every GPU processes all patches but only computes a slice of each attention head. Reduces per-GPU memory, requires all-reduce after each layer.

PipeFusion (pipeline parallelism): different transformer layers run on different GPUs, with activations streaming from one to the next. Requires careful scheduling to avoid pipeline bubbles (idle GPUs waiting for input).

Flash Attention + SageAttention

Even with multi-GPU parallelism, attention is the bottleneck. Standard attention computes a full 4096 × 4096 attention matrix, which is 64 MB in FP16 — per layer, per step. Flash Attention avoids materializing this matrix by computing attention in tiles that fit in GPU SRAM (shared memory), reducing memory usage from O(n2) to O(n).

Midjourney goes further with SageAttention[15], which quantizes the attention computation to 8-bit integers. This is 2.1x faster than FlashAttention2 with minimal quality loss. The key insight: attention scores are relative (they go through softmax), so absolute precision matters less than rank ordering. INT8 preserves the ranking while halving the memory bandwidth.

Common misconception: "Quantization always hurts quality." For attention specifically, 8-bit quantization is nearly lossless because softmax is a ranking operation. If score A > score B in FP16, the same holds in INT8 for all but the most extreme edge cases. The model's output is dominated by which patches attend to which — not the precise attention values.

The V8 Migration: JAX/TPU → PyTorch/GPU

For years, Midjourney trained on Google TPUs using JAX[2]. In March 2026, V8 launched as a complete rewrite in PyTorch on GPUs[4]. The result: 5x faster generation, native 2K resolution, under 10 seconds per image[17].

Why would they rewrite the entire stack? Three reasons. First, the GPU ecosystem (CUDA, cuDNN, TensorRT, Flash Attention, xDiT) is far more mature than the TPU ecosystem for inference optimization. Second, PyTorch's eager execution makes debugging and iteration faster than JAX's functional compilation model. Third, inference runs on GPUs regardless[3] — training on TPUs while serving on GPUs meant maintaining two codebases. Unifying on PyTorch/GPU eliminated that burden.

The 4-Image Grid: Amortizing Overhead

When you run /imagine, Midjourney generates a 2×2 grid of four variations[12]. This is not just a UX choice — it is a compute optimization. Loading model weights from GPU HBM to SRAM takes time. Setting up the execution context (CUDA kernels, memory allocations) takes time. By batching four images, you amortize that overhead across four outputs. The marginal cost of image 2, 3, and 4 is much less than image 1.

python
# Simplified diffusion inference loop
# Real Midjourney uses DiT + Flash Attention + SageAttention

import torch

def generate_images(prompt_embedding, model, steps=50, batch=4):
    # Start with pure noise (batch of 4 images)
    x = torch.randn(batch, 4, 64, 64)  # latent space
    
    # Denoising schedule: high noise → low noise
    timesteps = torch.linspace(1.0, 0.0, steps)
    
    for t in timesteps:
        # Each step = FULL forward pass through DiT
        # Cross-attention with text embeddings
        noise_pred = model(x, t, prompt_embedding)
        
        # Remove predicted noise (simplified)
        alpha = get_schedule(t)
        x = (x - (1 - alpha) * noise_pred) / alpha.sqrt()
    
    # Decode latents to pixels (VAE decoder)
    images = vae.decode(x)  # 4 x 3 x 1024 x 1024
    return images  # 4-image grid

Worked Example: GPU Memory Budget

Memory breakdown for one inference job. Model weights: ~5B parameters × 2 bytes (FP16) = 10 GB. KV-cache for attention (50 steps × 4096 tokens × 1280 dim × 32 layers × 2 bytes): ~21 GB, but only one step's cache is live at a time = ~650 MB. Batch of 4 latent images (4 × 4 × 64 × 64 × 2 bytes): ~130 KB (tiny). Activations during forward pass: ~4-8 GB (depends on layer count and sequence length). Total: ~15-19 GB. This fits on a single A100 (80 GB) with room to spare, or requires multi-GPU parallelism for larger models[30].
Diffusion Process Visualizer

Watch noise transform into structure. Drag the slider to see how each denoising step adds coherence. At step 0 (pure noise), the DiT sees random chaos. By step 50, global patterns emerge through cross-attention with the text prompt.

Denoising Step 0
Step 0/50 | Pure noise | No structure detected
Midjourney generates an image with 50 denoising steps. Each step is a full forward pass through a 5B-parameter DiT. If you increase to 75 steps, what changes?

Chapter 6: Image Storage & Delivery

The GPU just finished denoising. Four images sit in GPU memory as raw pixel tensors — 4 × 3 × 1024 × 1024 floating-point values. Now what? Those pixels need to travel from a Google Cloud GPU in some data center to a Discord message on your phone, ideally in under two seconds. And this has to happen 2.5 million times per day[12], every day, without losing a single image.

This chapter traces the storage and delivery pipeline from GPU memory to your screen, and confronts the surprising economics of serving images at scale.

The Scale Problem

Let us start with the raw numbers. Each Midjourney image, after post-processing and compression, is roughly 1-3 MB. At 2.5 million images per day using a 2 MB average:

Daily storage ingest: 2.5M images × 2 MB = 5 TB per day. Per month: 150 TB. Per year: 1.8 PB. Simon Willison measured 55 million images on the Discord CDN alone, totaling 148+ TB[23] — and that was just a snapshot of publicly accessible images.

Five terabytes per day sounds terrifying until you realize that storage is cheap. The real cost is bandwidth. Every time a user views an image, scrolls back to an old generation, shares it in a channel, or opens the web app, those bytes travel from a CDN edge node to the user's device. That is egress, and cloud providers charge dearly for it.

The Post-Processing Pipeline

Before any image reaches storage, it passes through a post-processing pipeline on the GPU worker (or a nearby CPU worker):

1. VAE Decode
Convert from latent space (64×64 per image) to pixel space (1024×1024). This is the VAE decoder, part of the diffusion model itself.
2. Super-Resolution (optional)
If user requested upscale (U1-U4 buttons), run a separate upscaling model. V8 generates natively at 2K[17], reducing the need for post-hoc upscaling.
3. Format Conversion
Encode raw pixels to WebP or PNG. WebP at quality 85 gives ~80% compression ratio with minimal perceptual loss. AVIF is even smaller but slower to encode.
4. Post-Gen Moderation
Image classifier checks the generated output for policy violations. This happens AFTER generation but BEFORE delivery[20]. A rejected image is never shown to the user.
5. Upload to Storage
Compressed image uploaded to object storage (Google Cloud Storage) and then served via CDN as a Discord attachment or through the web app CDN.

Discord CDN Delivery

For the first three years of Midjourney, Discord was not just the interface — it was the delivery infrastructure[13]. When the bot sends your completed image, it uploads it as a Discord attachment. Discord hosts these on its own CDN (backed by Google Cloud and Cloudflare). The URL looks like:

url
https://cdn.discordapp.com/attachments/{channel_id}/{message_id}/{filename}.png
  ?ex=6789abcd    # expiry timestamp (hex)
  &is=12345678    # issue timestamp
  &hm=abc123...   # HMAC signature

The signed expiring URLs[28] are critical for security. Without the HMAC signature, you cannot access the image. When the URL expires, you need a fresh signature. This prevents hotlinking (other sites embedding Midjourney images for free) and gives Discord control over bandwidth costs.

Common misconception: "Midjourney hosts all the images." For Discord users, Discord hosts the images. Midjourney uploads them as Discord attachments, and Discord's CDN serves them. This is why Midjourney did not need to build its own CDN infrastructure for years — Discord absorbed the bandwidth cost. The web app (launched with V6.1[13]) changed this equation by requiring Midjourney to serve images outside of Discord.

Tiered Storage Strategy

Not all images deserve the same storage treatment. An image generated 30 seconds ago will be viewed many times in the next few minutes (the user is iterating). An image from six months ago might never be viewed again. A rational storage system uses tiers:

TierAgeStorage TypeAccess LatencyCost (per TB/month)
Hot< 24 hoursSSD-backed object store~5 ms~$200
Warm1-30 daysStandard GCS~50 ms~$20
Cold> 30 daysNearline/Coldline GCS~200 ms~$4-7
Archive> 1 yearArchive GCS~hours~$1.2

Images automatically migrate down tiers based on age and access frequency. If a user revisits an old image, it gets temporarily promoted back to the hot tier (this is standard GCS lifecycle management).

Content-Addressable Storage

Here is a subtlety. Two users with the same prompt and seed will generate identical images. Storing both is wasteful. Content-addressable storage (CAS) solves this: hash the image bytes, use the hash as the storage key. If the hash already exists, return a pointer to the existing blob instead of storing a duplicate. Even a 1% deduplication rate saves 1.5 TB per month at Midjourney's scale.

Worked Example: Storage vs. Bandwidth Costs

The counterintuitive economics. Storage is cheap. Let us price out 150 TB/month of new images on Google Cloud Storage Nearline: 150,000 GB × $0.01/GB/month = $1,500/month. Shockingly cheap for the amount of data.

Now bandwidth. Assume each image is viewed an average of 5 times (initial view, Discord scrollback, gallery browse, share). That is 2.5M × 5 = 12.5M views/day. At 2 MB per view: 25 TB/day of egress. Google Cloud charges ~$0.08/GB for egress: 25,000 GB/day × $0.08 = $2,000/day = $60,000/month.

Bandwidth costs 40x more than storage. This is why every large-scale image service obsesses over CDN caching, image compression, and progressive loading — not storage optimization.

This also explains why Discord CDN was such a gift to Midjourney's early economics. Discord absorbed the egress costs. When Midjourney launched its own web app, they suddenly owned that $60K/month bandwidth bill (likely much higher with web app traffic added).

The Web App Challenge

With the web app launched in August 2024[13], Midjourney now serves images through two channels: Discord (where Discord pays for CDN) and the web app (where Midjourney pays). The web app likely uses its own CDN — Cloudflare or Google Cloud CDN — with edge caching at dozens of global PoPs (Points of Presence).

Edge caching means that when a user in Tokyo views an image, the first request fetches it from the origin server (US) and caches it at the Tokyo edge node. The next Tokyo user who views that image gets it from the local cache, saving a transpacific round trip (~150 ms) and egress cost. Cache hit rates of 60-80% are typical for image CDNs, which would cut Midjourney's effective bandwidth cost by 3-5x.

Video: The Next Storage Challenge

Midjourney launched Video V1 in June 2025 — 5-second clips at 8x the GPU cost of a still image[29]. A 5-second video at 24 fps in H.264 is roughly 5-15 MB — 5x larger than a still image. If even 10% of generations shift to video, daily storage ingest jumps from 5 TB to 7.5 TB, and bandwidth costs increase proportionally. Video also cannot be served as static files — it needs adaptive bitrate streaming (HLS/DASH) for smooth playback across connection speeds.

Storage Pipeline Visualizer

Watch images flow from GPU output through post-processing, moderation, and tiered storage to CDN delivery. The storage meter shows cumulative growth. Notice how bandwidth costs dwarf storage costs.

Stored: 0 TB | Egress today: 0 TB | Cost: $0/month
Midjourney generates 2.5M images/day at ~2 MB each. If each image is viewed on average 5 times, which cost dominates?

Chapter 7: End-to-End Request Trace

You have now seen the queue, the GPU pipeline, and the storage layer individually. But real understanding comes from tracing a single request through the entire system. This chapter follows one /imagine a cyberpunk fox reading a newspaper --v 7 --q 1 command from the moment you press Enter to the moment the image appears in your Discord channel. We will measure latency at every hop and identify where time actually goes.

This is the chapter that makes the architecture real. Memorize this trace and you can whiteboard Midjourney's system design in any interview.

The Happy Path (Fast Mode, ~15-20 seconds)

Here is every hop, with measured or estimated latency at each stage:

Step 1: User Input (0 ms)
You type /imagine and press Enter in Discord. The Discord client sends a slash command interaction to Discord's API gateway.
↓ ~50 ms (network to Discord)
Step 2: Discord API Gateway (~10 ms)
Discord validates the interaction, looks up the Midjourney bot application ID, and forwards the interaction payload via webhook or gateway to Midjourney's bot server.
↓ ~10 ms (Discord → MJ bot)
Step 3: Bot Server Receives (~5 ms)
Midjourney's bot server parses the interaction. Extracts prompt text, parameters (--v 7, --q 1, --ar 16:9, --stylize, etc.), user ID, subscription tier.
↓ ~5 ms
Step 4: Prompt Parsing & Parameter Extraction (~5 ms)
Parse "a cyberpunk fox reading a newspaper" from flags. Resolve defaults (aspect ratio 1:1, quality 1, no seed specified → random seed). Validate parameter ranges.
↓ ~20 ms
Step 5: Pre-Generation Moderation (~20 ms)
Text classifier checks the prompt against policy[20]. "Cyberpunk fox reading a newspaper" passes easily. Blocked prompts receive an immediate rejection message. No GPU time wasted.
↓ ~5 ms
Step 6: Job Enqueued (~5 ms)
Job enters the Fast priority lane. The bot immediately responds to Discord with an acknowledgment message: "Generating..." with a progress bar at 0%.
↓ ~5 seconds (queue wait, Fast mode)
Step 7: Queue Wait (~5 seconds)
Job waits in the Fast lane until a GPU worker with sufficient memory becomes available. During peak hours this could be 30-60 seconds. During off-peak, nearly instant.
↓ ~100 ms (dispatch)
Step 8: GPU Worker Dispatch (~100 ms)
Orchestrator selects a GPU worker, sends the job payload (prompt embeddings, parameters, seed). Worker acknowledges. A timeout clock starts (120 seconds typical).
↓ ~10-50 seconds (THE BOTTLENECK)
Step 9: Diffusion Inference (10-50 seconds)
THE BOTTLENECK. 20-50 denoising steps, each a full forward pass through the DiT. Text encoding, cross-attention, self-attention with SageAttention, 4-image batch. V7: ~20s. V8: <10s[17]. During inference, progress updates (25%, 50%, 75%) are sent back to the bot.
↓ ~2 seconds
Step 10: Post-Processing (~2 seconds)
VAE decode (latent → pixels), optional upscale, format conversion (PNG/WebP), grid assembly (4 images into 2×2).
↓ ~500 ms
Step 11: Post-Gen Moderation (~500 ms)
Image classifier checks the output[20]. Most images pass. Flagged images are blocked and the user sees a policy violation message instead.
↓ ~1 second
Step 12: Upload to Storage + CDN (~1 second)
Image uploaded to GCS + Discord CDN as an attachment. Signed URL generated. Metadata (prompt, parameters, seed, timestamp) stored in the job database.
↓ ~200 ms
Step 13: Discord Message Updated (~200 ms)
Bot edits its earlier "Generating..." message, replacing the progress bar with the finished image grid. Adds reaction buttons (U1-U4 for upscale, V1-V4 for variations). User sees the result.

Latency Budget Breakdown

Let us add up the total for a V7 Fast mode generation:

PhaseLatency% of Total
Network + API routing~80 ms0.3%
Parsing + Moderation (pre)~30 ms0.1%
Queue wait (Fast)~5,000 ms19%
GPU Inference (diffusion)~20,000 ms74%
Post-processing + moderation (post)~2,500 ms9%
Upload + delivery~1,200 ms4.5%
Total~27 seconds100%
The bottleneck is always GPU inference. At 74% of total latency, the diffusion process dominates everything else combined. This is why V8's 5x speedup[17] was transformational — it cut the total from ~27 seconds to ~12 seconds by attacking the only phase that matters.

The Progress Update Mechanism

While the GPU is working, the user is staring at a progress bar. How does it update? The GPU worker sends progress messages back to the bot server after every N denoising steps. The bot server then edits its Discord message using the Discord API's message edit endpoint. A typical generation shows four progress states: 0% → 25% → 50% → 75% → 100% (final image). Some users see intermediate noisy previews — the GPU worker sends the current latent decoded to a low-resolution preview image at each progress checkpoint.

The Relax Path (~5-30 minutes)

For a Relax-mode user, steps 1-6 are identical. The difference is step 7: instead of waiting 5 seconds, the job sits in the Relax lane for 0-30 minutes[26]. The system drains Relax jobs only when GPU workers are idle — during off-peak hours or when Turbo/Fast demand dips. The user sees "Queued (Relax)" and a position indicator. The actual GPU inference time (step 9) is the same as Fast mode — Relax does not use fewer steps or lower quality.

The Error Paths

What happens when things go wrong? Three failure modes dominate:

FailureWhereWhat HappensUser Sees
Moderation blockStep 5 or 11Job terminated, no GPU used"This prompt violates our policy"
GPU OOMStep 9Worker crashes, job retried on different worker with more VRAMLonger wait, then result (usually)
TimeoutStep 9120-second timeout exceeded. Likely complex scene with high step count. Dead letter queue."Generation failed, please try again"

The moderation block at step 5 is deliberately placed before the queue to avoid wasting GPU time on prompts that would be rejected anyway. The post-gen moderation at step 11 catches cases where an innocent-sounding prompt produces a policy-violating image (the model can sometimes generate unexpected content from ambiguous prompts).

V7 vs V8: The Migration Impact

The PyTorch migration[4] shrank the dominant phase by 5x. Let us compare:

V7 (JAX/TPU training, GPU inference)
GPU inference: ~20 seconds
Total end-to-end: ~27 seconds
Bottleneck: 74% GPU
Max resolution: 1K (upscale for 2K)
V8 (PyTorch/GPU, complete rewrite)
GPU inference: ~4 seconds
Total end-to-end: ~12 seconds
Bottleneck: 33% GPU, 42% queue
Native 2K resolution[17]
Common misconception: "V8 is faster because of better hardware." V8's 5x speedup comes from software — the PyTorch rewrite, SageAttention[15], better parallelism via xDiT[16], and architectural improvements. The GPUs are the same. This is why great systems engineers are worth more than faster chips.

Notice something interesting in the V8 column: the bottleneck shifted. When GPU inference drops to 4 seconds, the queue wait (5 seconds) becomes the largest component. This is a textbook example of Amdahl's Law — once you optimize the dominant component, the next-largest component becomes the new bottleneck. The next optimization frontier for Midjourney is queue dispatch latency, not GPU speed.

Worked Example: Throughput Under Load

Peak-hour math. Assume 1 million concurrent users[11] and 5% are actively generating at any moment = 50,000 active generators. Each submits one job every 2 minutes on average = 25,000 jobs/minute = 417 jobs/second. With 10,000 GPUs[5] and each job using one GPU for ~10 seconds (V8): max throughput = 10,000 / 10 = 1,000 jobs/second. So peak demand (417/s) is well under capacity (1,000/s). But remember: Turbo jobs may use 2-4 GPUs, upscale jobs use GPUs too, and video uses 8x the GPU time[29]. Real headroom is tighter than it looks.
Request Flow Tracer

Watch a glowing dot trace the full path from /imagine to delivered image. Each component lights up on arrival, and the latency counter accumulates. Toggle between Happy Path (~15s), Relax Path (~5min), and Moderation Block.

Click a path to trace a request through the system
In Midjourney's V8 system, a Fast mode request takes ~12 seconds end-to-end. Which single optimization would reduce this the most?

Chapter 8: The Evolution — From 10 Engineers to 10,000 GPUs

Most startups scale by adding features. Midjourney scaled by adding zeros — zero marketing spend, zero VC after mid-2022, zero public papers. What they did add was compute: from a handful of GPUs in a rented cloud account to a fleet of 10,000 GPUs spread across Google Cloud[1]. This is the story of how an architecture evolves when your user count goes from 10 to 20 million in four years.

It started in August 2021, when David Holz — fresh from selling Leap Motion — gathered roughly 10 engineers[9]. They built a working prototype in one month. One month. No product-market fit research, no advisory board, no pitch deck. Just a Discord bot that turned text into images. The prototype used a standard diffusion model running on rented NVIDIA GPUs. At that scale, the "architecture" was basically one server with a queue.

Then came the open beta in July 2022 — and everything broke.

The scaling paradox: Midjourney went viral precisely BECAUSE the architecture was simple — a Discord bot, no sign-up flow, no app download. But that simplicity meant they had zero infrastructure for scale. When millions of users showed up, they literally ran out of GPUs at every cloud vendor they tried. The thing that made them grow was the thing that couldn't handle growth.

The first real architectural decision came in November 2022, with V4. Holz moved training to Google Cloud TPU v4 pods[2], using JAX — Google's framework optimized for TPU hardware. This was a bet: JAX had a smaller community than PyTorch, fewer tutorials, fewer engineers who knew it. But TPU v4 offered raw training speed that GPUs couldn't match at the time, and JAX extracted every flop from them.

Meanwhile, inference stayed on NVIDIA GPUs[3]. This split — TPU for training, GPU for inference — defined the architecture for three years. It worked beautifully until it didn't.

V5 (March 2023) brought "significantly different neural architectures." V6 (December 2023) was trained from scratch over 9 months[21] — a massive compute investment that only a profitable, VC-free company could justify without quarterly pressure. By now the team had grown to ~40 people[9], revenue was $200M[10], and the fleet had expanded to thousands of GPUs handling 2.5 million images per day[12].

Then came V8 in March 2026 — and it changed everything.

The V8 Rewrite: Burning the Ships

V8 was not an incremental improvement. It was a complete rewrite from JAX/TPU to PyTorch/GPU[4]. The entire training and inference stack — every custom kernel, every optimization trick, every workaround accumulated over three years of rapid iteration — was thrown away and rebuilt from scratch in PyTorch.

Why would a profitable company with a working system do this? Three reasons:

Ecosystem Lock-in
JAX has a smaller community. Hiring JAX engineers is 5-10x harder than hiring PyTorch engineers. Every new library, tool, and paper targets PyTorch first.
Hardware Flexibility
TPU v4 only runs on Google Cloud. PyTorch + GPU runs anywhere — any cloud, on-prem, NVIDIA's latest chips. Hardware independence = negotiating power on price.
Technical Debt
Three years of "ship fast, fix later" had created a codebase full of workarounds. A clean rewrite let them redesign the architecture for the current scale, not the scale they had in 2022.

The result: V8 inference is 5x faster than V6, generates at native 2K resolution, and completes in under 10 seconds[17]. That speed gain — not a marginal 20% tweak but a 5x leap — is what justified the risk of rewriting everything.

The rewrite rule: Never rewrite a working system for incremental improvement. Only rewrite when the expected gain is transformational — 3x minimum. Midjourney got 5x. At that multiple, the risk of regression bugs and the cost of duplicated effort are worth it. Below 2x, iterate.

Revenue Efficiency: The Smallest Big Tech Company

As of 2026, Midjourney has roughly 192 employees[9] and generates approximately $500M in annual revenue[10]. That's about $2.6M revenue per employee — and if you count only engineers (roughly half the team), it's closer to $5M per engineer[25]. For comparison, Google generates about $1.5M per employee. Meta, $1.7M. Midjourney is 3x more efficient than the most profitable tech companies on Earth.

How? No VC means no growth-at-all-costs pressure[8]. No free tier means every user pays. No marketing spend means Discord virality does the work. No papers and no blog means no team dedicated to external communications. Every person ships product.

Worked example — revenue per GPU-hour: $500M revenue / year. ~10,000 GPUs[5] running 24/7 = 87.6M GPU-hours/year. Revenue per GPU-hour = $500M / 87.6M = $5.71/GPU-hour. GPU cloud cost is ~$2-3/hour for A100s. That's roughly a 2x margin on raw compute alone — before accounting for staff, CDN, and Discord costs. Tight, but profitable because 90% of cost is inference[6], and they've optimized inference relentlessly.

What's Next: Beyond Images

Midjourney hired Ahmad Abbas from Apple Vision Pro to lead hardware efforts. They launched Video V1 in June 2025 — 5-second clips at 8x the GPU cost of images. They're exploring 3D generation and real-time interactive creation. Each new modality multiplies compute demand. The fleet that handles 2.5M images/day may need to handle 2.5M videos/day — at 8x the cost per job. That's a 20x compute scaling challenge.

Below, you can trace this evolution interactively. Drag the slider to move through time and watch the architecture transform at each breakpoint.

Scale Timeline — Architecture Evolution

Drag the slider to see how Midjourney's infrastructure evolved from 10 users to 20 million. Components appear, split, and merge as scale demands.

Era 2021
Why did Midjourney rewrite their entire stack from JAX/TPU to PyTorch/GPU in V8, rather than incrementally improving the existing codebase?

Chapter 9: When Things Break — Failure Modes & Reliability

Running 10,000 GPUs[5] is not like running 10. At 10 GPUs, failures are events — you notice them, you fix them, you move on. At 10,000 GPUs, failures are weather. They're constant, ambient, and you design around them the way a ship designer accounts for waves. You don't prevent them. You survive them.

Here's the math that changes your thinking. GPU hardware failure rates in large data centers run 1-3% at any given time. On a fleet of 10,000 GPUs, that means 100 to 300 GPUs are failing right now. Not "might fail someday." Failing right now, this second, as you read this sentence. Some have memory errors. Some have thermal throttling. Some have crashed drivers. Some just stopped responding. Every single day, the fleet regenerates — bad GPUs get pulled, repaired or replaced, and put back. The fleet is a living organism, not a machine.

Misconception: "Reliability means preventing failures." At Midjourney's scale, preventing all failures is impossible. A 99.9% reliable GPU still means 10 failures at any given moment across the fleet. Reliability means surviving failures — making the system so resilient that users never notice when a GPU dies mid-generation. The question isn't "will something break?" but "when it breaks, what happens to the user's image?"

Let's think about this concretely. A user hits "Generate" on a prompt. Their job enters the queue, gets assigned to GPU #4,721, and the diffusion process begins — 20, 30, maybe 50 denoising steps. On step 34, the GPU's memory controller throws an ECC error and the process crashes. What happens?

The Circuit Breaker Pattern

When a GPU worker fails, the orchestrator doesn't just retry on the same GPU. That would be madness — the GPU is probably still broken. Instead, it uses a circuit breaker: if a GPU fails N times within a window (say, 3 failures in 10 minutes), it's pulled from the active pool entirely. No more jobs get routed to it. A health-check daemon monitors pulled GPUs and reintroduces them only after they pass a diagnostic suite.

The failed job itself gets re-queued with high priority — it goes to the front of the line, not the back. The user sees a brief extra delay, maybe 5-10 seconds, but they get their image. They never know a GPU died.

GPU Fails Mid-Job
ECC error, OOM, driver crash, thermal shutdown
Orchestrator Detects
Heartbeat timeout (no response in 5-10 seconds)
Circuit Breaker Trips
Mark GPU as suspect, increment failure counter
Job Re-queued
Priority boost, assigned to healthy GPU, restart from checkpoint or from scratch
GPU Quarantined
If 3+ failures: pull from pool, run diagnostics, repair or replace

The Cold Start Problem

When a fresh GPU comes online (or a repaired GPU returns to the pool), it can't serve jobs immediately. The diffusion model weights — we're talking 2-10 GB depending on the model version and resolution — need to be loaded from storage into GPU VRAM. This takes 30 seconds to 2 minutes depending on the model size and network speed. During that time, the GPU is consuming electricity and costing money but producing nothing.

The solution is warm pools: keep a subset of GPUs with models pre-loaded at all times, even if they're idle. When a job arrives, it goes to a warm GPU instantly. Cold GPUs are loaded in the background and added to the warm pool as capacity grows. The tradeoff is real: an idle warm GPU with an A100 costs $2-3/hour in cloud fees. Twenty idle warm GPUs cost $40-60/hour. That's the price of instant responsiveness.

Worked example — daily GPU failure cost: 10,000 GPUs × 2% failure rate = 200 GPU failures per day. Each failure wastes one in-flight job (~30 seconds of GPU time) plus cold-start recovery (~1 minute). Total wasted time: 200 × 1.5 min = 300 GPU-minutes = 5 GPU-hours. At $3/GPU-hour, that's $15/day — about $5,500/year. Compared to $500M in revenue[10], GPU failures cost 0.001% of revenue. The lesson: at scale, individual hardware failures are a rounding error. System-level failures are what kill you.

Graceful Degradation: Relax Mode as a Shock Absorber

Midjourney's tier system — Fast, Relax, Turbo[18] — isn't just a pricing mechanism. It's an architecture-level resilience feature. When the GPU fleet is under strain (peak hours, partial outage, or a viral prompt trend that sends request volume spiking), the system degrades gracefully. Fast and Turbo users keep their priority, and Relax users simply wait longer. No one gets an error. No one gets rejected. The queue just stretches.

Think of it like a hospital triage system. Emergency patients (Fast/Turbo) get seen first. Walk-in patients (Relax) wait. During a disaster, walk-ins might wait hours — but they still get seen. Nobody is turned away.

The Discord Dependency — The Real Risk

GPU failures are manageable. The real availability risk for Midjourney for most of its history was Discord itself. Until the web app launched with V6.1 in August 2024[13], Discord was a single point of failure. If Discord went down — API outage, rate limiting, server issues — Midjourney went down. Completely. All 20 million users[11] locked out. GPUs sitting idle. Revenue dropping to zero.

This happened multiple times. Discord rate limits throttled bot commands. Discord CDN outages meant generated images couldn't be delivered. Discord auth issues meant users couldn't even start sessions. None of these were Midjourney bugs — they were dependency failures, and Midjourney had zero control over them.

This is why they built the web app. It wasn't about a better UI. It was about survival. The web app gave them a second entry point, independent of Discord's infrastructure. Today, if Discord goes down, web app users keep generating. The dependency is halved, not eliminated — Discord is still the primary channel for millions — but the existential risk is gone.

Moderation Failure Modes

Content moderation has two failure modes, and they're in direct tension. False positives block legitimate prompts — a user trying to generate "surgery scene for medical textbook" gets rejected. This frustrates paying customers. False negatives let harmful content through — generating photorealistic deepfakes or explicit material. This creates PR disasters and legal exposure.

Midjourney runs both pre-generation text moderation and post-generation image moderation[20]. The pre-gen filter is fast (text classification, milliseconds) and catches obvious violations. The post-gen filter is slower (image classification, seconds) and catches outputs that look benign from the prompt but produce problematic images. If the post-gen filter triggers, the image is generated (GPU time already spent) but never delivered to the user. That's wasted compute — but the alternative (no post-gen filter) is worse.

What happens if moderation itself goes down? This is the nightmare scenario. Without moderation, the system must shut down entirely. You cannot serve unfiltered AI-generated images to millions of users — the legal, ethical, and reputational risk is catastrophic. Moderation is not a feature. It's load-bearing infrastructure.

Below, you can simulate what happens when various components fail. Click a component to "kill" it and watch the cascade.

Failure Cascade Simulator

Click any component to disable it (turns red). Watch how failures propagate through the system. Click the same component again to restore it, or use Reset.

All systems operational
What was the BIGGEST availability risk for Midjourney before 2024?

Chapter 10: The Tradeoffs — Design Decisions & Alternatives

Every architectural decision is a tradeoff. You gain something; you lose something. The trick isn't finding the "right" answer — it's understanding what you're trading away, so you can make the trade consciously. Midjourney made five major architectural decisions that shaped everything. Each one had a clear alternative, a clear cost, and a clear payoff. Let's walk through all five.

Decision 1: Discord as Primary UI

In 2021, Midjourney didn't build an app. They didn't build a website. They built a Discord bot. Users typed /imagine in a chat channel, and images appeared. That's it. No login flow, no sign-up page, no app store review, no mobile development, no web hosting.

What they gained was staggering. Discord's server system meant that when one person generated an image, everyone in the channel saw it. Images were inherently social — you could watch what other people were creating, react to it, remix it. This turned every user into a marketing channel. The result: 20 million users[11] at near-zero customer acquisition cost, making Midjourney the largest Discord server in history.

What they lost was control. Discord sets the rate limits. Discord controls the UX. Discord can change its API, its pricing, or its terms of service at any time. And critically — Discord was a single point of failure. If Discord went down, Midjourney went down. For three years, a $300M+ revenue company depended entirely on another company's infrastructure for its user interface.

Misconception: "Discord was a temporary hack they should have replaced sooner." No. Discord wasn't just convenient — it was load-bearing. The social visibility that drove viral growth doesn't exist in a private web app. When Midjourney launched its web app in August 2024[13], they kept Discord as a first-class channel. The web app isn't a replacement — it's a backup.

Decision 2: Google Cloud TPUs for Training

When V4 launched in November 2022, Midjourney moved training to Google Cloud TPU v4 pods[2] using the JAX framework. TPU v4 pods offered massive matrix-multiply throughput — better raw TFLOPS per dollar for training large diffusion models. JAX, Google's ML framework, was the natural fit for TPU hardware.

The gain: faster training for V4, V5, and V6. The ability to train V6 from scratch in 9 months[21] rather than 18 months on GPUs (rough estimate based on compute differences).

The loss: JAX lock-in. The JAX ecosystem is smaller than PyTorch's — fewer libraries, fewer Stack Overflow answers, fewer job candidates who know it. Every custom operator, every training trick, every optimization had to be built in JAX. And when they wanted to leave, they had to rewrite everything[4]. The JAX decision bought them three years of speed and cost them a massive rewrite.

Decision 3: Closed Source, No Papers, No Blog

Midjourney has published zero papers. Zero blog posts. Zero technical talks. They don't open-source their models, their training code, or their inference stack. In an industry where Stability AI, Google, Meta, and OpenAI all publish extensively, Midjourney is a black box.

What they gained: a competitive moat. Nobody can replicate their exact architecture, their training data pipeline[22], their model weights, or their inference optimizations. Competitors can study papers from other labs and build on them — but Midjourney's work stays proprietary.

What they lost: academic reputation, open-source community contributions, and a recruiting pipeline. Top ML researchers want to publish. They want their work cited. Midjourney can't offer that. But with $500M in revenue[10] and no VC dilution[8], they can offer something else: equity in a profitable company. The market validated this choice.

Decision 4: No Official API

As of 2026, Midjourney still has no official REST API for developers. Every image generation goes through either Discord or the web app. There's no curl midjourney.com/v1/generate. No API keys. No per-call billing.

What they gained: simplicity. Revenue comes through subscriptions — flat monthly fees of $10-120. No metering infrastructure, no usage-based billing, no API abuse mitigation (rate limiting, auth, key management). The billing system is Discord and Stripe. That's it.

What they lost: the developer ecosystem. Canva, Figma, Adobe, and hundreds of startups would pay for API access. Enterprise deals with SLAs and custom integrations are off the table. Third-party developers reverse-engineer the Discord API anyway, building unofficial wrappers — Midjourney gets zero revenue from this usage and no control over the experience.

Decision 5: Small Team, High Revenue per Employee

At ~192 people[9] generating $500M[10], Midjourney achieves about $2.6M revenue per employee. For engineers specifically, it's closer to $5M per engineer[25]. Compare: Stability AI had ~200 people at $100M revenue (peak). DALL-E is backed by thousands of OpenAI employees across many products.

What they gained: speed. Fewer people means fewer meetings, fewer approval chains, fewer Slack threads. David Holz can make architectural decisions in hours, not quarters. The V8 rewrite — a terrifying, company-bet decision — was made and executed without a committee.

What they lost: breadth. Video launched in June 2025, years after competitors. 3D is still nascent. They don't have enterprise sales, a developer relations team, or a research publications group. They do one thing — image generation — and they do it at world-class level. Everything else waits.

Worked example — the cost of Discord as a platform: Midjourney has stored 148+ TB of images on Discord's CDN[23]. Discord's Nitro costs are borne by users, not Midjourney. But Discord takes a cut of subscription payments through its App Directory. Let's estimate: if 10% of Midjourney's $500M revenue flows through Discord's payment system at Discord's standard 15% cut, that's $7.5M/year — the cost of Discord as a platform. Is that more or less than building and operating their own auth, payments, CDN, and social features? Almost certainly less. Discord is a good deal, even with the dependency risk.

Below, you can compare each decision side-by-side. Toggle between the five major tradeoffs to see what Midjourney chose, what they could have chosen instead, and what each path costs.

Tradeoff Comparison Matrix

Click each decision tab to compare what Midjourney chose vs. the alternative. The warm-colored side is what they picked.

DecisionChoseAlternativeBiggest Consequence
PlatformDiscord botCustom app20M users at zero CAC, but single point of failure
Training HWTPU v4 + JAXGPU + PyTorchFaster training, but required complete V8 rewrite
OpennessClosed sourceOpen source$500M revenue moat, but no academic community
APINo public APIDeveloper APISimple billing, but lost enterprise ecosystem
Team size~192 peopleScale to 1000+$5M/engineer, but limited to one product
Which tradeoff had the biggest long-term technical consequence, requiring Midjourney to eventually rebuild their entire system?

Chapter 11: The Blueprint — Patterns You Can Steal

Everything we've studied in this lesson — the queue architecture, the GPU fleet management, the Discord platform strategy, the tiered pricing — these aren't unique to image generation. They're reusable architectural patterns that apply to any system where expensive async compute serves consumer users. If you're building a video encoding pipeline, a real-time 3D renderer, a scientific simulation service, or any GPU-heavy inference product, these patterns are yours to steal.

Let's extract six patterns. For each: what the pattern is, when to use it, and a concrete example beyond Midjourney.

Pattern 1: The Platform Launcher

When to use: You're building a consumer AI product with zero marketing budget and need distribution yesterday.
The pattern: Don't build your own app. Build a bot or plugin on an existing platform with millions of users — Discord, Slack, Telegram, WhatsApp. Let the platform handle auth, payments, social sharing, and push notifications. Your engineering team focuses entirely on the AI backend. The platform is your frontend.

Midjourney did this with Discord and grew to 20 million users[11] with essentially zero customer acquisition cost. The key insight: the social visibility of generations (everyone in the channel sees your images) turned every user into an unpaid marketer.

Beyond Midjourney: ChatGPT in Slack, Notion AI (embedded in existing Notion workspace), GitHub Copilot (embedded in VS Code). All piggyback on platforms where users already live.
The exit strategy: Always plan for the day you'll need your own app. Discord worked until it became a single point of failure. Build the platform integration first for growth, but budget for the native app when you hit $100M+ revenue.

Pattern 2: The GPU Job Queue

When to use: Any system where processing takes more than 5 seconds per request — ML inference, video transcoding, 3D rendering, scientific simulation, large document processing.
The pattern: Decouple submission (fast, O(1)) from processing (slow, O(minutes)). Users submit to a queue and get a job ID immediately. Workers pull from the queue at their own pace. The queue absorbs traffic spikes, handles priority ordering, and provides a natural backpressure mechanism.

Midjourney processes 2.5 million images per day[12] at 20-40 jobs/second[24]. Each generation uses 5 petaops of compute[7] and takes 10-60 seconds. Without a queue, a traffic spike would crash the GPU fleet. With a queue, it just makes users wait a bit longer.

Worked example — queue sizing: Peak load: 40 jobs/sec. Average generation: 30 seconds. GPU fleet: 10,000 GPUs[5]. If each GPU handles 1 job at a time, fleet throughput = 10,000 / 30 = ~333 jobs/sec. At 40 jobs/sec input, utilization = 40/333 = 12%. That seems low — but remember Fast/Relax/Turbo tiers[18], concurrency limits (3-15 jobs per user[19]), and bursty demand. The queue smooths all of this.
Beyond Midjourney: AWS MediaConvert (video transcoding queue), Render queues in Blender farms, Lambda's event queue for serverless. The pattern is universal.

Pattern 3: The 90/10 Inference Split

When to use: Any ML-based product. The moment your model is in production, inference costs will dominate your budget.
The pattern: Optimize for inference cost, not training cost. Training happens once (or periodically). Inference happens on every single user request, forever. Midjourney spends 90% of compute on inference and 10% on training[6]. A 10% inference speedup saves 9% of total compute budget. A 10% training speedup saves 1%.

This is why the V8 rewrite's 5x inference speedup[17] was transformational. It didn't just make images faster — it cut the dominant cost by 80%. That's the difference between a profitable company and a VC-subsidized one.

The architectural consequence: use Flash Attention[15], multi-GPU parallelism[16], and inference-optimized model architectures (DiT[14]) even if they make training slightly harder. The ROI on inference optimization is 9x higher than training optimization.

Beyond Midjourney: OpenAI spent more optimizing GPT-4 inference (KV cache, speculative decoding, quantization) than training. Google's TPU v5e was designed inference-first. The pattern scales to every ML company.

Pattern 4: The Relax Buffer

When to use: Expensive compute with utilization valleys — GPUs sitting idle at 3 AM, servers underloaded on weekdays, capacity reserved for peak that rarely peaks.
The pattern: Offer a cheaper (or free) tier that absorbs excess capacity. Give these users lower priority — they run when premium users aren't using the fleet. This transforms idle GPUs from pure cost into revenue (or at least into user acquisition).

Midjourney's Relax mode charges $0 per image (included in subscription) but runs only when Fast/Turbo GPUs have spare capacity[18]. During peak hours, Relax users wait minutes. During off-peak, they get images in seconds. The GPU fleet stays near 100% utilization either way.

Beyond Midjourney: AWS Spot Instances (use excess EC2 capacity at 60-90% discount), Google Preemptible VMs, airline standby tickets. The pattern: sell excess capacity cheaply rather than letting it rot.

Pattern 5: The Platform Migration Rewrite

When to use: Your framework or platform limits your hardware options, talent pool, or performance ceiling, AND the expected improvement is 3x+. Not for incremental gains.
The pattern: Rewrite from scratch when tech debt exceeds incremental improvement capacity. Not a refactor — a rewrite. New framework, new architecture, new codebase. Keep the old system running while the new one is built. Cut over only when the new system matches feature parity AND delivers the transformational gain.

Midjourney's V8 rewrite from JAX/TPU to PyTorch/GPU[4] took months. They shipped anyway because the gain was 5x[17]. The risk: accumulated years of workarounds meant the JAX codebase had optimizations that were hard to replicate. The reward: a clean, modern, hirable-into codebase on the industry-standard framework.

Beyond Midjourney: Twitter rewriting from Ruby on Rails to Scala/JVM (2010-2013). Facebook rewriting PHP with HHVM/Hack. Dropbox rewriting Python to Rust for the sync engine. All were "never rewrite" violations that paid off because the gain was transformational.

Pattern 6: The Petascale Consumer Product

When to use: Consumer product that requires expensive per-interaction compute — GPU rendering, ML inference, scientific simulation — where you can't eat the cost and need users to self-throttle.
The pattern: Make the compute cost visible to users. Don't hide it behind an all-you-can-eat subscription. Give users a budget (Fast hours), a way to see how much they've used, and a free-but-slow fallback (Relax). Users self-select into the right tier. Power users pay $4/hour for GPU time[18]; casual users use Relax. No one is surprised by their bill.

Midjourney generates $500M[10] from ~1-2.5 million daily active users[11]. Average revenue per DAU: $200-500/year. That's only possible because users understand they're buying GPU time, not "images." The mental model of compute-as-resource drives willingness to pay.

Beyond Midjourney: RunPod (GPU cloud sold by the second), Replicate (per-prediction pricing), Render farms like Sheepit (credit-based rendering). The pattern: if compute is your COGS, make it your pricing axis.

Choosing the Right Pattern

Here's a quick decision matrix. When you face a new system design problem, ask these questions:

QuestionIf Yes, Use This Pattern
Do I need fast distribution with zero marketing budget?Pattern 1: Platform Launcher
Does my processing take >5 seconds per request?Pattern 2: GPU Job Queue
Is my product ML-based and already in production?Pattern 3: 90/10 Inference Split
Do I have expensive compute sitting idle during off-peak?Pattern 4: Relax Buffer
Is my framework limiting my performance ceiling by 3x+?Pattern 5: Platform Migration Rewrite
Is per-interaction compute cost too high to eat?Pattern 6: Petascale Consumer Product
Misconception: "These patterns only apply to AI companies." Not even close. The GPU Job Queue is used by video platforms (YouTube transcoding), game studios (build farms), and financial firms (risk simulations). The Relax Buffer is how airlines sell standby seats and cloud providers sell spot instances. The Platform Launcher is how every successful chatbot (banking, customer service, internal tools) gets distribution through Slack or Teams. These patterns are hardware-and-domain-agnostic.

References

All sources cited throughout this lesson:

  1. Google Cloud & Midjourney. “Midjourney Selects Google Cloud.” PR Newswire, March 2023. Link CONFIRMED
  2. David Holz quote in Google Cloud PR: “training the latest versions of our algorithms on the v4 TPUs with JAX.” Link CONFIRMED
  3. David Holz: inference on “huge clusters of GPUs” via Google Cloud. Same PR as [1]. CONFIRMED
  4. Midjourney V8 rewrite from JAX/TPU to PyTorch/GPU (March 2026). Analysis LIKELY
  5. David Holz: ~10,000 GPUs/servers. The Register, Interconnects CONFIRMED
  6. 90% inference / 10% training cost split. Interconnects CONFIRMED
  7. David Holz: ~5 petaops per image. The Register CONFIRMED
  8. Self-funded, profitable since mid-2022. Contrary Research CONFIRMED
  9. Team: 10 (2021) → 40 (2024) → ~192 (2026). Contrary, DemandSage CONFIRMED
  10. Revenue: $200M (2023), $500M (2025). Contrary, DemandSage CONFIRMED
  11. 20M Discord users, 1–2.5M DAU. DemandSage CONFIRMED
  12. 2.5M images generated daily. Medium LIKELY
  13. Web app launched with V6.1, August 2024. Wikipedia CONFIRMED
  14. DiT architecture confirmed by xDiT fork. Midjourney GitHub CONFIRMED
  15. Flash Attention & SageAttention. Midjourney GitHub forks CONFIRMED
  16. xDiT multi-GPU parallelism. Midjourney GitHub CONFIRMED
  17. V8: 5x faster, native 2K, <10s. Multiple sources. LIKELY
  18. Fast/Relax/Turbo modes, $4/hr GPU time. Midjourney Docs CONFIRMED
  19. Concurrency: 3–15/user, 10-job queue. Midjourney Docs CONFIRMED
  20. Dual moderation: text pre-filter + image post-filter. Multiple sources. LIKELY
  21. V6 trained from scratch over 9 months. Wikipedia CONFIRMED
  22. Training data: LAION-5B + web scraping. Contrary LIKELY
  23. 148+ TB on Discord CDN. Simon Willison CONFIRMED
  24. 20–40 jobs/sec capacity. DemandSage LIKELY
  25. $5M revenue/employee. Envzone CONFIRMED
A startup is building a product that uses GPUs to render 3D scenes. Each render takes 45 seconds. They have bursty demand (quiet at night, slammed during work hours) and their GPU fleet is 30% idle on average. Which TWO patterns would help them most?