From random flips and color jitter to Mixup, CutMix, and learned policies — every trick that turns 100 images into 100,000.
You have 100 photos of dogs and 100 photos of cats. You train a CNN. Training accuracy: 99%. Test accuracy: 55%. The model didn't learn what dogs look like — it memorized the training images.
It recognizes THIS golden retriever at THIS angle against THIS background. Show it the same dog from a different angle? Uncertain. A different golden retriever? Coin flip. A poodle? Complete failure. The model learned a lookup table, not a concept.
The fix doesn't require more data. It requires more versions of your data.
A modern CNN has millions of parameters. ResNet-50 has 25.6 million learnable weights. If you feed it 200 training images, it has 128,000 parameters per image. That's like asking someone to summarize a one-sentence tweet using a 128,000-word essay — there's so much capacity that it's easier to memorize than to generalize.
The technical term is overfitting: the model fits the training data perfectly but fails on new data because it learned noise and specifics instead of patterns and generalities.
Think of it this way. You're studying for an exam by memorizing the answer key to last year's test. You'll ace that exact test. But this year's test has different questions — and you're lost. You memorized answers instead of understanding the subject.
The simulation below makes overfitting visible. We have two classes of points (orange and teal) in 2D space. With only 10 points per class, the model has enough capacity to draw a wildly complex boundary that threads through every single training point. The boundary is perfect on training data and catastrophic on test data.
Toggle augmentation on. Now each original point has jittered copies scattered around it. The model can no longer thread a jagged boundary through exact point locations — it's forced to learn the region where each class lives. The boundary becomes smooth and generalizes.
Left: 10 points per class, no augmentation — boundary overfits. Right: same data with augmented copies — boundary smooths out. Drag the strength slider to control how far augmented copies scatter.
Notice the difference. Without augmentation, the boundary zigzags dramatically to classify every single training point. With augmentation, the cloud of points around each original forces the boundary into a smooth curve that captures the shape of each class, not the location of individual points.
Let's trace the math of how augmentation multiplies your data.
Setup: 1,000 training images. Each epoch, every image gets a random augmentation (random crop position, random flip, random color jitter). The transforms are continuous — the crop offset can be any real number, the brightness multiplier any value in [0.8, 1.2].
Epoch 1: Image #42 gets: crop at offset (3.7, 12.1), no flip, brightness ×1.07. The model sees a specific pixel pattern.
Epoch 2: Image #42 gets: crop at offset (8.2, 5.4), horizontal flip, brightness ×0.93. Different pixels entirely.
Probability of identical augmentation: The crop offset alone is continuous in a ~30×30 pixel range. The probability of picking the exact same (x, y) offset twice is essentially zero. Add flip (2 choices), brightness (continuous), contrast (continuous), and the probability of repeating the exact same augmentation is vanishingly small.
| Training Setup | Images per Epoch | 100 Epochs Total | Unique Patterns Seen |
|---|---|---|---|
| No augmentation | 1,000 | 100,000 | 1,000 (same images repeated) |
| With augmentation | 1,000 | 100,000 | ~100,000 (each a unique variant) |
python import torchvision.transforms as T from PIL import Image # Load one image img = Image.open("dog_042.jpg") # Without augmentation: same pixels every epoch transform_none = T.Compose([ T.Resize((224, 224)), T.ToTensor(), ]) # With augmentation: different pixels every epoch transform_aug = T.Compose([ T.RandomResizedCrop(224, scale=(0.8, 1.0)), T.RandomHorizontalFlip(p=0.5), T.ColorJitter(brightness=0.2, contrast=0.2), T.ToTensor(), ]) # Call transform_aug(img) three times: v1 = transform_aug(img) # crop at random offset, maybe flipped, brightness +8% v2 = transform_aug(img) # different crop, not flipped, brightness -12% v3 = transform_aug(img) # yet another crop, flipped, brightness +3% # Three calls, three different tensors — same dog, different pixels
Over the next chapters, we'll build every major augmentation technique from scratch. They fall into three families:
By the end, you'll understand exactly what
torchvision.transforms does under the hood, why certain
augmentations help certain tasks, and how to design an augmentation
pipeline for any domain.
Let's start with the most impactful family: geometric transforms.
A cat is still a cat whether it's in the top-left corner or the bottom-right. Whether it faces left or right. Whether it's close up or far away. But a CNN trained without augmentation doesn't know this.
Remember how convolutions work. A 3×3 filter slides across the image detecting local patterns — edges, textures, curves. But the filter at position (10, 10) learns independently from the filter at position (200, 200). If every training photo shows the cat centered at position (112, 112), the filters at the center learn "cat" and the filters at the edges learn "background." Move the cat to the corner at test time, and those center filters see background while the corner filters see unfamiliar cat textures.
Geometric transforms fix this by changing WHERE pixels are: shifting, flipping, rotating, scaling, and cropping the image so the model sees every object at every possible position and orientation.
Random cropping is the single most widely used geometric augmentation. The idea is simple: resize the image slightly larger than your target size, then take a random crop of the target size.
ResNet, the architecture that won ImageNet in 2015 and is still used as a backbone today, uses this exact recipe: resize each image so the shorter side is 256 pixels, then take a random 224×224 crop.
Why does this work? Because the crop offset changes every epoch. In epoch 1, the network sees the dog's face centered in the crop. In epoch 2, the crop captures mostly the dog's body with the face at the top edge. In epoch 3, the face is in the bottom-left corner. The network is forced to recognize "dog" regardless of where in the 224×224 window the dog appears.
RandomResizedCrop(224), test
with Resize(256) followed by CenterCrop(224).
With probability 0.5, flip the image left-to-right. That's it. This single operation doubles your effective dataset with zero quality cost, because mirror images are completely natural for most visual tasks.
A dog facing left is just as valid a training example as a dog facing right. A car driving right-to-left is just as real as one driving left-to-right. Horizontal flip is so effective and so safe that it's included in virtually every image training pipeline.
Rotation applies a random rotation within a range, typically ±15°. Small rotations are natural — photos are rarely perfectly level. A slight tilt doesn't change what the object is.
But be careful with the range. Rotating a photo of a living room by 90° puts the furniture on the wall — that's not a scene a model should learn from. Rotating by 180° produces an upside-down world. For natural images, keep rotation under ±30°. For aerial or satellite images (where orientation is arbitrary), full 360° rotation is fine.
Scaling (also called zoom) randomly resizes the image within a range, like [0.8, 1.2]×. Scale < 1.0 zooms out (object gets smaller, more background visible). Scale > 1.0 zooms in (object gets larger, details more visible, edges cropped).
Let's count the variants from one 8×8 image with simple discrete transforms.
Random crop to 6×6: The crop origin (top-left corner) can be placed at row 0, 1, or 2 and column 0, 1, or 2. That's 3 × 3 = 9 possible crops. Each crop shows a slightly different 6×6 region of the original 8×8 image.
Add horizontal flip: Each of the 9 crops can be flipped or not flipped. That's 9 × 2 = 18 variants.
Add rotation (0°, 90°, 180°, 270°): Each of the 18 crop-flip combinations can be rotated 4 ways. That's 18 × 4 = 72 variants.
Add 2 scale levels (0.9×, 1.1×): 72 × 2 = 144 variants from a single image.
The simulation below shows a simple pixel grid representing an image. Toggle each geometric transform and click "Augment" to apply random parameters. Click repeatedly to see the variety produced from a single original. Four augmented versions appear side by side.
Toggle transforms on/off. Click "Augment!" to generate 4 random variants of the 8×8 source grid. Each click produces different results.
Notice how crop alone produces substantially different views — the object might be centered, shifted left, or partially cut off at the right edge. Add flip and the variety doubles. Each combination creates a training example that forces the model to recognize the object regardless of position and orientation.
python import numpy as np def random_crop(img, crop_h, crop_w): """Crop a random region of size (crop_h, crop_w) from img.""" h, w = img.shape[:2] top = np.random.randint(0, h - crop_h + 1) # random row offset left = np.random.randint(0, w - crop_w + 1) # random col offset return img[top:top+crop_h, left:left+crop_w] # simple slice def horizontal_flip(img, p=0.5): """Flip left-right with probability p.""" if np.random.random() < p: return img[:, ::-1] # reverse columns return img def random_rotation(img, max_angle=15): """Rotate by a random angle in [-max_angle, +max_angle] degrees.""" from scipy.ndimage import rotate angle = np.random.uniform(-max_angle, max_angle) return rotate(img, angle, reshape=False, mode='reflect') # Usage: chain transforms augmented = random_crop(img, 224, 224) # crop first augmented = horizontal_flip(augmented) # then maybe flip augmented = random_rotation(augmented) # then maybe rotate
And the standard torchvision pipeline that does the same thing:
python import torchvision.transforms as T train_transform = T.Compose([ T.RandomResizedCrop(224, scale=(0.8, 1.0)), # random crop + scale T.RandomHorizontalFlip(p=0.5), # 50% chance flip T.RandomRotation(degrees=15), # ±15° rotation T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], # ImageNet stats std=[0.229, 0.224, 0.225]), ]) test_transform = T.Compose([ T.Resize(256), # resize shorter side to 256 T.CenterCrop(224), # deterministic center crop T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ])
T.Normalize step
(subtracting ImageNet mean, dividing by ImageNet std) always comes
LAST, after all augmentations. This is not data augmentation — it's
preprocessing. It standardizes the pixel range so the network sees
consistent input scales. Augmentations modify the image; normalization
standardizes the tensor.
The same dog photographed at noon, at sunset, and under fluorescent lights looks dramatically different in pixel values. The fur goes from bright golden to deep amber to washed-out yellow-green. The shadows shift, the contrast changes, the whites turn warm or cool.
If your training set contains only noon photos, the model learns that dogs have specific brightness values and color distributions. Show it a sunset photo and the pixel values are so different that the model's learned features don't fire. It's not that the model can't recognize dogs — it can't recognize dogs under different lighting.
Photometric transforms fix this by changing HOW pixels look without changing where they are: brightness, contrast, saturation, hue, blur, and noise. These simulate the infinite variety of real-world camera and lighting conditions.
Brightness adjustment is pure multiplication. Every pixel value gets multiplied by a factor. Factor > 1.0 brightens the image. Factor < 1.0 darkens it. Factor = 1.0 is unchanged.
Let's trace a single pixel through a brightness adjustment.
Pixel: [R, G, B] = [200, 100, 50]. This is a warm orange-brown — think golden retriever fur in daylight.
Brightness × 1.2: Multiply each channel by 1.2.
Result: [240, 120, 60]. Brighter, but same color ratio — still golden.
Brightness × 0.6: Simulate a shaded area.
Result: [120, 60, 30]. Much darker — same dog, just in shadow.
Contrast adjustment measures how far each pixel is from the mean and scales that distance. High contrast means pixels are spread far from the mean (vivid, punchy). Low contrast means pixels are clustered near the mean (flat, washed out).
The formula: for each pixel value v and the image mean μ:
This is a lerp (linear interpolation) between the pixel value and the mean. Factor > 1 pushes pixels away from the mean (higher contrast). Factor < 1 pulls them toward the mean (lower contrast). Factor = 0 makes everything equal to the mean (solid gray).
Hand calculation. Pixel [200, 100, 50]. Image mean across all pixels (let's say) is μ = 117.
Contrast × 1.3:
Result: [225, 95, 30]. The bright channel (R) got brighter. The dark channels got darker. The image looks punchier, more vivid.
Saturation controls how vivid or muted the colors are. Saturation = 0 is grayscale. Saturation = 2 is hyper-vivid. The implementation converts RGB to HSV (Hue-Saturation-Value), scales the S channel, and converts back.
Hue shifts the entire color wheel. A hue shift of +30° turns reds into oranges, oranges into yellows, yellows into greens. This simulates different color temperatures — warm sunset light shifts hues one direction, cool fluorescent light shifts them another.
Gaussian blur convolves the image with a Gaussian kernel, averaging each pixel with its neighbors. This simulates an out-of-focus camera, motion blur, or simply low-quality optics. Kernel sizes of 3×3 to 7×7 are typical. The effect: sharp edges become soft, fine textures become smooth. The model learns to classify objects even when details are lost.
Gaussian noise adds random values sampled from N(0, σ2) to each pixel. This simulates sensor noise in low-light conditions. High σ makes the image look grainy. The model learns to "see through" noise to find the underlying object.
Salt-and-pepper noise randomly sets pixels to either 0 (black, "pepper") or 255 (white, "salt"). This simulates dead or stuck pixels on a camera sensor, or corrupted data transmission. Even with 5% of pixels corrupted, a well-augmented model should recognize the object.
Drag the sliders below to apply each photometric transform to a colorful pixel grid. Watch the actual RGB values change as you adjust brightness, contrast, saturation, and noise level.
Drag each slider to adjust the corresponding transform. The grid updates in real time. RGB values for the selected pixel appear below.
python import numpy as np def adjust_brightness(img, factor): """Multiply all pixel values by factor. Clip to [0, 255].""" return np.clip(img * factor, 0, 255).astype(np.uint8) def adjust_contrast(img, factor): """Scale pixel distances from mean. factor=1 is unchanged.""" mean = img.mean() # global mean across all channels return np.clip((img - mean) * factor + mean, 0, 255).astype(np.uint8) def add_gaussian_noise(img, sigma=25): """Add N(0, sigma^2) noise to each pixel. Clip to [0, 255].""" noise = np.random.normal(0, sigma, img.shape) return np.clip(img + noise, 0, 255).astype(np.uint8) def gaussian_blur(img, kernel_size=5): """Blur with a Gaussian kernel. Larger kernel = more blur.""" from scipy.ndimage import gaussian_filter sigma = (kernel_size - 1) / 6.0 # standard heuristic return gaussian_filter(img, sigma=sigma) # The standard torchvision one-liner: import torchvision.transforms as T jitter = T.ColorJitter( brightness=0.2, # ±20% brightness contrast=0.2, # ±20% contrast saturation=0.2, # ±20% saturation hue=0.1, # ±10% hue shift )
Dropout randomly zeros neurons. Weight decay penalizes large weights. Data augmentation randomly perturbs inputs. All three are regularization — they prevent the model from fitting noise in the training data. But augmentation is unique: it doesn't change the model architecture or the loss function. It changes the data.
This chapter reveals a deep and beautiful connection: augmenting your input with random noise is mathematically equivalent to training on a smoothed version of the loss function. This isn't a metaphor — it's a theorem. And it explains exactly why augmentation improves generalization.
Without augmentation, the model minimizes the loss at each exact training point:
The model can achieve L = 0 by memorizing each (xi, yi) pair — fitting a function that passes exactly through every training point, no matter how jagged that function becomes.
With augmentation, the model sees xi + ε instead of xi, where ε is a random perturbation (the augmentation). Now the effective loss becomes:
This is the expected loss over all perturbations. The model can't just get the answer right at the exact point xi. It has to get the answer right at xi + ε for every possible ε. That means it has to get the answer right in the entire neighborhood around xi.
Let's make this concrete with a 1D curve-fitting example.
Setup: 5 training points: (-2, 4), (-1, 1), (0, 0), (1, 1), (2, 4). These lie on the parabola y = x2. We want the model to learn this parabola.
Without augmentation: A degree-4 polynomial has 5 coefficients (a4x4 + a3x3 + a2x2 + a1x + a0). Five coefficients, five points — the polynomial can pass through all 5 points exactly, achieving training loss = 0. But the polynomial that passes through exactly these 5 points is not necessarily y = x2. It might be some wild curve that oscillates between the points.
With augmentation: Each point gets 10 jittered copies. Point (1, 1) spawns (0.9, 0.81), (1.1, 1.21), (1.05, 1.10), etc. Now we have 50 points, all roughly following y = x2. A degree-4 polynomial cannot perfectly fit 50 points — it only has 5 degrees of freedom. It must find the best smooth curve through the cloud, which is y ≈ x2.
The augmented points didn't add new information about the function. They added constraints that prevent overfitting. More constraints, fewer solutions, smoother fit.
The smoothed loss landscape has another property: it favors flat minima over sharp minima.
A sharp minimum is one where the loss drops steeply at the exact parameter values θ* but rises quickly if you perturb θ slightly. A flat minimum is one where the loss is low for a broad region of parameter space around θ*.
Why does this matter? Because training and test data come from the same distribution but are not identical. If your model sits in a sharp minimum, the tiny distributional shift between train and test moves the effective parameters off the cliff — test loss is much higher than training loss. In a flat minimum, the same shift barely matters — the model is robust to small changes.
Augmentation penalizes sharp minima because the random perturbations ε are equivalent to slightly perturbing the input at every step. If the model is at a sharp minimum, these perturbations cause large loss spikes, pushing the model toward flatter regions where perturbations don't hurt.
Modern training uses all three regularization strategies simultaneously. They're complementary because each constrains a different thing:
| Regularizer | What It Constrains | Mechanism | Effect |
|---|---|---|---|
| Weight Decay | Parameter magnitudes | Add λ||θ||2 to loss | Prevents large weights → smoother functions |
| Dropout | Internal representations | Randomly zero hidden units | Prevents co-adaptation → redundant features |
| Augmentation | Input sensitivity | Randomly perturb inputs | Prevents memorization → invariant features |
ResNet-50 on ImageNet uses all three: weight decay = 1e-4, dropout is not used (ResNets rely on batch normalization instead), and aggressive augmentation (random crop + flip + color jitter). Remove the augmentation and test accuracy drops by 2-4%. Remove weight decay and training becomes unstable. Each regularizer does work that the others can't.
The simulation below shows a 2D classification task trained with and without augmentation. On the left, no augmentation — the model overfits (training loss drops to zero, test loss stays high, boundary is jagged). On the right, with augmentation — training loss is higher (the model can't memorize anymore) but test loss is much lower (it generalizes). Drag the slider to control augmentation strength.
Left: without augmentation (overfitting). Right: with augmentation (generalizing). Drag the strength slider to see the train/test gap close. Click "Train" to run 200 steps.
Watch the key pattern. As augmentation strength increases from 0 to 1: the training loss goes UP (the model can't achieve perfect training accuracy anymore — every epoch shows different pixel patterns). But the test loss goes DOWN (the model generalizes better). The gap between training and test loss — the generalization gap — shrinks. That gap IS overfitting, and augmentation closes it.
python import torch import torchvision import torchvision.transforms as T # Experiment: CIFAR-10 with and without augmentation # NO augmentation — just resize and normalize transform_none = T.Compose([ T.ToTensor(), T.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)), ]) # WITH augmentation — the standard CIFAR recipe transform_aug = T.Compose([ T.RandomCrop(32, padding=4), # pad 4px, random crop back to 32 T.RandomHorizontalFlip(), # 50% chance T.ToTensor(), T.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)), ]) # Train identical ResNet-18 models with each transform # After 200 epochs: # No augmentation: train_acc=99.9%, test_acc=91.2% (gap: 8.7%) # With augmentation: train_acc=97.1%, test_acc=95.0% (gap: 2.1%) # Augmentation LOWERS train accuracy but RAISES test accuracy # The generalization gap shrinks from 8.7% to 2.1%
Notice the numbers. Without augmentation, the model gets 99.9% on training data (nearly perfect memorization) but only 91.2% on test data (poor generalization). With just two augmentations — random crop and horizontal flip — the model gives up some training accuracy (97.1%) but gains nearly 4% on test accuracy (95.0%). The generalization gap drops from 8.7% to 2.1%.
That's the regularization effect in a single number. The model is worse at memorizing and better at understanding.
You've been manually picking augmentations — flip with p=0.5, rotate ±15°, jitter brightness ±20%. But how do you know those are the right settings? What if the optimal rotation is ±8° and brightness should be ±35%? You're guessing. And with 14+ possible transforms, each with its own magnitude and probability, the search space is enormous.
Three papers attacked this problem in sequence. AutoAugment searched for the answer with reinforcement learning. RandAugment said "just randomize everything." TrivialAugment said "pick one random transform per image." And the simplest approach won.
Google's AutoAugment (2018) treated augmentation policy design as a search problem. The idea: define a search space of augmentation operations, then use reinforcement learning (PPO) to find the combination that maximizes validation accuracy on a proxy task.
The search space is staggering. A "policy" consists of 5 sub-policies, each containing 2 operations. Each operation is a triple: (transform type, probability, magnitude). With 16 transform types, 11 probability levels (0.0 to 1.0 in steps of 0.1), and 10 magnitude levels, each operation has 16 × 11 × 10 = 1,760 possibilities. A sub-policy has two operations: 1,7602 ≈ 3.1 million combinations. Five sub-policies: (3.1M)5 ≈ 2.9 × 1032 possible policies. The search cannot be exhaustive — it must be guided.
AutoAugment used a recurrent neural network (the "controller") trained with PPO to propose policies. The controller outputs a sequence of operations, those operations are applied to training data, a child model is trained, and its validation accuracy becomes the reward signal. After thousands of trials, the controller converges on a task-specific optimal policy.
The fatal flaw: the search cost thousands of GPU hours. For CIFAR-10, AutoAugment required 15,000 child model trainings. Each child trains for a fixed number of epochs, and the controller needs thousands of reward signals to converge. This is prohibitively expensive for most teams.
RandAugment (Cubuk et al., 2020) asked a heretical question: what if the search is unnecessary? What if random selection with a shared magnitude works just as well?
The algorithm is almost insultingly simple. For each training image:
That's it. Two hyperparameters: N (how many transforms per image, typically 2–3) and M (how strong, typically 9–14). No controller network. No child model training. No search. Just a grid search over N and M — maybe 20 combinations total.
Let's trace RandAugment with N=2, M=10 on a single image.
Step 1: Random select from 14 transforms → "rotate" is chosen. Magnitude 10 out of 30 maps to rotation angle: 10/30 × 30° = 10°. The image rotates 10° clockwise.
Step 2: Random select again → "posterize" is chosen. Magnitude 10 out of 30 maps to bits per channel: round(8 − 10/30 × 4) = round(8 − 1.33) = 7 bits per channel (reducing from 8-bit to 7-bit color depth — mild posterization).
The augmented image is rotated 10° and slightly posterized. The next image in the batch gets a completely different random pair — maybe "shearX" at M=10 followed by "brightness" at M=10. Over an epoch, the model sees enormous variety despite the simple algorithm.
TrivialAugment (Müller & Hutter, 2021) pushed simplicity even further. For each training image:
No N. No M. Zero hyperparameters. And it slightly beats RandAugment on average across benchmarks. The lesson is profound: the diversity from random selection provides sufficient regularization. Over the course of training, even though each individual image gets just one mild perturbation, the model collectively sees thousands of different transforms at varying strengths.
| Method | Year | Hyperparameters | Search Cost | ImageNet Top-1 |
|---|---|---|---|---|
| AutoAugment | 2018 | ~30 (policy params) | 15,000 GPU-hours | 77.6% |
| RandAugment | 2020 | 2 (N, M) | ~0 (grid search) | 77.6% |
| TrivialAugment | 2021 | 0 | 0 | 77.8% |
Read that table carefully. The method with zero hyperparameters and zero search cost matches or beats the method that took 15,000 GPU-hours to find its policy. The entire field of learned augmentation spent years discovering that random is good enough.
Click "Augment" to apply each policy to the same sample image. AutoAugment uses a fixed learned policy (same 5 transforms every time). RandAugment randomly picks N=2 transforms at magnitude M. TrivialAugment picks ONE random transform at random magnitude. Click repeatedly to see the variety each method produces.
Notice the pattern after clicking "Augment" 10+ times for each method. AutoAugment always applies the same 5 learned sub-policies, cycling through them. The augmented images look similar after a while — you can predict what's coming. RandAugment produces more variety because N=2 transforms are randomly selected each time, but the magnitude is fixed. TrivialAugment produces the most variety: one transform at a random strength each time, so you get everything from barely-perceptible brightness shifts to heavy rotations.
python import random from PIL import Image, ImageOps, ImageEnhance # The 14 standard transforms def rand_augment(img, N=2, M=10, max_mag=30): """Apply N random transforms at magnitude M.""" transforms = [ ("identity", lambda im, m: im), ("autoContrast",lambda im, m: ImageOps.autocontrast(im)), ("equalize", lambda im, m: ImageOps.equalize(im)), ("rotate", lambda im, m: im.rotate(m / max_mag * 30)), ("solarize", lambda im, m: ImageOps.solarize(im, 256 - int(m / max_mag * 256))), ("posterize", lambda im, m: ImageOps.posterize(im, max(1, 8 - int(m / max_mag * 4)))), ("contrast", lambda im, m: ImageEnhance.Contrast(im).enhance(1 + m / max_mag)), ("brightness", lambda im, m: ImageEnhance.Brightness(im).enhance(1 + m / max_mag)), ("sharpness", lambda im, m: ImageEnhance.Sharpness(im).enhance(1 + m / max_mag)), ("shearX", lambda im, m: im.transform(im.size, Image.AFFINE, (1, m/max_mag*0.3, 0, 0, 1, 0))), ("shearY", lambda im, m: im.transform(im.size, Image.AFFINE, (1, 0, 0, m/max_mag*0.3, 1, 0))), ("translateX", lambda im, m: im.transform(im.size, Image.AFFINE, (1, 0, m/max_mag*im.size[0]*0.3, 0, 1, 0))), ("translateY", lambda im, m: im.transform(im.size, Image.AFFINE, (1, 0, 0, 0, 1, m/max_mag*im.size[1]*0.3))), ("color", lambda im, m: ImageEnhance.Color(im).enhance(1 + m / max_mag)), ] chosen = random.sample(transforms, N) for name, fn in chosen: img = fn(img, M) return img
python # TrivialAugment is even simpler def trivial_augment(img, max_mag=30): """Apply ONE random transform at random magnitude.""" transforms = [...] # same list as above name, fn = random.choice(transforms) M = random.randint(0, max_mag) # random magnitude! return fn(img, M)
python # Using torchvision's built-in (recommended for production) from torchvision import transforms pipeline = transforms.Compose([ transforms.RandAugment(num_ops=2, magnitude=9), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) # TrivialAugment: torchvision.transforms.TrivialAugmentWide() pipeline_trivial = transforms.Compose([ transforms.TrivialAugmentWide(), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ])
What does a 70% cat, 30% dog look like? That's a strange question — in nature, an image is either a cat or a dog, never a blend. But Mixup says: blend the pixel values. CutMix says: paste a rectangular patch of the dog onto the cat. Both create images that don't exist in the real world — and that's precisely the point.
These methods force the model to learn gradations of confidence rather than binary decisions. Instead of "this is definitely a cat," the model must learn "this is mostly a cat but partially a dog." The result is smoother decision boundaries, better calibration, and stronger generalization.
Mixup (Zhang et al., 2018) blends two training examples linearly, both their inputs and their labels.
Step 1: Sample a mixing coefficient λ from a Beta(α, α) distribution. The parameter α controls how much mixing happens:
Step 2: Create the mixed image and mixed label:
Where xA, xB are two training images and yA, yB are their one-hot label vectors.
Let's trace Mixup with λ = 0.7. Image A is a cat (class 0), Image B is a dog (class 1). We have 3 classes total (cat, dog, bird).
Pixel blending. Take a single pixel at position (3, 3):
Mixed pixel:
Result pixel: [155, 135, 130] — a muted, washed-out blend of both images.
Label blending. One-hot labels:
Mixed label:
The label is no longer a hard class — it's a distribution. The model must output 70% cat confidence, 30% dog confidence, 0% bird confidence. This is fundamentally different from standard training where the target is always [1, 0, 0] or [0, 1, 0].
CutMix (Yun et al., 2019) replaces the linear pixel blend with a spatial one: cut a rectangular region from image B and paste it onto image A. The label is mixed proportionally to the visible area.
Step 1: Sample λ from Beta(α, α), same as Mixup.
Step 2: Generate a random rectangle whose area is (1 − λ) of the total image area. If the image is W×H pixels, the cut region has area = (1 − λ) × W × H. The rectangle's center is uniformly random; its width and height are: rw = W × √(1 − λ), rh = H × √(1 − λ).
Step 3: Paste that rectangle from image B onto image A. Everything outside the rectangle stays as image A.
Step 4: The label is λ · yA + (1 − λ) · yB — proportional to visible area.
Image size: 32×32 pixels. λ = 0.7. The cut area should be (1 − 0.7) = 0.3 of total area = 0.3 × 1024 = 307 pixels.
Rectangle dimensions: rw = 32 × √0.3 = 32 × 0.548 ≈ 17.5 → round to 18 pixels wide. rh = 32 × √0.3 ≈ 18 pixels tall. Actual area: 18 × 18 = 324 pixels (close to target of 307).
Random center: say (20, 16). The cut rectangle spans x=[11, 29], y=[7, 25]. Everything inside that 18×18 box comes from image B (the dog). Everything outside stays as image A (the cat). The label: [0.7, 0.3, 0.0] — same weighting as Mixup, but the visual effect is totally different.
The deeper insight is that Mixup and CutMix don't just add data — they change the loss function. With hard labels [1, 0, 0], the cross-entropy loss pushes the model toward infinite confidence: the optimal output under cross-entropy is a logit of positive infinity for the correct class. The model is rewarded for being maximally overconfident.
With soft labels [0.7, 0.3, 0.0], the loss has a finite optimum. The model learns calibrated uncertainty — its confidence scores actually correspond to accuracy. A Mixup-trained model that says "80% cat" is right about 80% of the time when it says that. A standard-trained model that says "80% cat" might be right 95% of the time — its confidence is meaningless.
This is called calibration, and it matters enormously in safety-critical applications. A self-driving car's classifier shouldn't say "99.9% pedestrian" when it's actually only 70% sure.
Two colored pattern grids represent images from two classes. Drag the λ slider to control the mixing ratio. Toggle between Mixup (pixel blend) and CutMix (rectangular patch). Watch the blended label change in real time.
python import numpy as np import torch def mixup(x, y, alpha=0.2): """Mixup two batches of images and labels. x: (B, C, H, W) tensor of images y: (B, num_classes) one-hot labels Returns mixed images and soft labels.""" lam = np.random.beta(alpha, alpha) # Shuffle indices to pair each image with a random partner idx = torch.randperm(x.size(0)) x_mix = lam * x + (1 - lam) * x[idx] y_mix = lam * y + (1 - lam) * y[idx] return x_mix, y_mix def cutmix(x, y, alpha=1.0): """CutMix: paste rectangular patch from one image onto another.""" lam = np.random.beta(alpha, alpha) B, C, H, W = x.shape idx = torch.randperm(B) # Random rectangle with area ratio = (1 - lam) cut_w = int(W * np.sqrt(1 - lam)) cut_h = int(H * np.sqrt(1 - lam)) cx = np.random.randint(W) # random center cy = np.random.randint(H) # Clip to image boundary x1 = max(0, cx - cut_w // 2) x2 = min(W, cx + cut_w // 2) y1 = max(0, cy - cut_h // 2) y2 = min(H, cy + cut_h // 2) # Paste the patch x_mix = x.clone() x_mix[:, :, y1:y2, x1:x2] = x[idx, :, y1:y2, x1:x2] # Adjust lambda to actual clipped area lam_actual = 1 - (x2 - x1) * (y2 - y1) / (W * H) y_mix = lam_actual * y + (1 - lam_actual) * y[idx] return x_mix, y_mix
python # Integration into training loop for images, labels in train_loader: # Convert to one-hot for soft label mixing labels_onehot = torch.nn.functional.one_hot(labels, num_classes).float() # Apply CutMix with 50% probability, Mixup otherwise if np.random.random() < 0.5: images, labels_soft = cutmix(images, labels_onehot) else: images, labels_soft = mixup(images, labels_onehot) outputs = model(images) # Use soft cross-entropy (not F.cross_entropy which expects hard labels) loss = -torch.sum(labels_soft * torch.log_softmax(outputs, dim=1), dim=1).mean() loss.backward() optimizer.step() optimizer.zero_grad()
Time to put it all together. You've learned geometric transforms, photometric transforms, regularization effects, learned policies, and mixing methods. Now you're the engineer: build an augmentation pipeline, apply it to data, and watch how it affects training.
This is your augmentation workbench. Toggle transforms on and off. Adjust the overall strength. Change the dataset size. Then hit "Train!" and watch two training curves unfold: one with your augmentation pipeline, one without. Your goal: close the gap between training and validation loss — that gap is overfitting, and augmentation is your weapon against it.
Toggle transforms to build your pipeline. Adjust strength and dataset size. Click "Apply Pipeline" to see augmented samples, then "Train!" to simulate training curves with vs. without your augmentation. Watch the overfitting gap change.
Things to try:
Everything so far was about images. Flip an image horizontally and it's still a cat. But text is different — "The movie was great" and "The film was excellent" mean the same thing, yet changing one word can reverse the meaning entirely. And there's one more powerful technique we haven't covered: augmenting at test time.
Text augmentation is fundamentally harder than image augmentation because language is discrete and semantic. You can rotate an image 5° and the label doesn't change. But delete the word "not" from "I do not like this movie" and you've reversed the sentiment. Every text augmentation must be label-preserving, and verifying that is much harder for text than for images.
Despite this difficulty, several effective methods exist:
Translate the text to a foreign language, then translate it back. The round-trip produces a natural paraphrase that preserves meaning but changes wording.
The translation model replaces "mat" with "carpet" — a natural synonym substitution that a rule-based system might miss. Different target languages produce different paraphrases: German might yield "The cat sat upon the rug." Back-translation produces the most natural augmentations because the translation model has learned grammar and semantics, but it requires a translation model (or API), which adds cost and latency.
EDA (Wei & Zou, 2019) is the "RandAugment of text" — four simple operations applied with small probabilities:
| Operation | What it does | Example |
|---|---|---|
| Synonym Replacement | Replace n random words with synonyms | "The happy dog ran quickly" → "The joyful dog ran rapidly" |
| Random Insertion | Insert a random synonym of a random word at a random position | "The cat sat" → "The fluffy cat sat" |
| Random Swap | Swap two random words | "I love this movie" → "I movie this love" |
| Random Deletion | Delete each word with probability p | "The happy dog ran quickly" → "The dog ran" |
EDA applies each operation with probability proportional to 1/sentence_length, so shorter sentences get less perturbation (they're more fragile — deleting 1 word from a 5-word sentence is much more destructive than from a 50-word one).
On small datasets (500 training examples), EDA improves classification accuracy by 2–3%. On larger datasets, the improvement shrinks, similar to image augmentation.
Here's a technique that works for images, text, and almost any modality: Test-Time Augmentation (TTA). The idea is deceptively simple: at inference time, create N augmented versions of the input, run each through the model, and average the N prediction vectors.
Why would this help? Each augmentation slightly changes the input, and the model's prediction might wobble. Some augmentations push the prediction toward the correct class; others push it away. By averaging, the correct signal reinforces while the noise cancels out. It's the same principle as ensemble methods, but using a single model with data perturbation instead of multiple models.
We have a trained image classifier and one test image. We create 3 augmented versions:
Average prediction:
The final prediction is [0.683, 0.317] — still cat, and more robust than any individual prediction. The color-shifted version's uncertainty (0.55 vs 0.45) was diluted by the other two more confident predictions.
Now imagine the model was borderline on a different image. Original: [0.48, 0.52] (slight dog lean). Flipped: [0.55, 0.45] (slight cat lean). Cropped: [0.52, 0.48] (slight cat lean). Average: [0.517, 0.483] — the aggregated vote tips toward cat, correcting the original prediction that would have been wrong.
| Domain | TTA Strategy | Typical N | Accuracy Gain |
|---|---|---|---|
| Image Classification | Flip + 4 corner crops + center crop | 10 | +1–2% |
| Object Detection | Multi-scale (0.5×, 1×, 1.5×) + flip | 6 | +1–3% mAP |
| Medical Imaging | 8 rotations (0°, 45°, ..., 315°) + flip | 16 | +2–5% |
| Text Classification | Back-translation to 3 languages + original | 4 | +0.5–1% |
Top: Type a sentence and click augmentation buttons to see each text augmentation method. Bottom: A sample "image" gets augmented N ways, each runs through a simulated model, and predictions are averaged. Watch how TTA stabilizes the final prediction.
python import random from nltk.corpus import wordnet def get_synonyms(word): """Get synonyms from WordNet.""" syns = set() for ss in wordnet.synsets(word): for lemma in ss.lemmas(): if lemma.name() != word: syns.add(lemma.name().replace('_', ' ')) return list(syns) def synonym_replacement(words, n=1): """Replace n random words with synonyms.""" new_words = words.copy() candidates = [w for w in words if get_synonyms(w)] random.shuffle(candidates) for word in candidates[:n]: syns = get_synonyms(word) synonym = random.choice(syns) new_words = [synonym if w == word else w for w in new_words] return new_words def random_deletion(words, p=0.1): """Delete each word with probability p.""" if len(words) == 1: return words remaining = [w for w in words if random.random() > p] return remaining if remaining else [random.choice(words)] def random_insertion(words, n=1): """Insert n random synonyms at random positions.""" new_words = words.copy() for _ in range(n): syns = [] while not syns: w = random.choice(new_words) syns = get_synonyms(w) new_words.insert(random.randint(0, len(new_words)), random.choice(syns)) return new_words def eda(sentence, alpha=0.1, n_aug=4): """Apply all EDA operations, return n_aug augmented sentences.""" words = sentence.split() n = max(1, int(alpha * len(words))) augmented = [] for _ in range(n_aug): op = random.choice(['sr', 'ri', 'rs', 'rd']) if op == 'sr': aug = synonym_replacement(words, n) elif op == 'ri': aug = random_insertion(words, n) elif op == 'rs': aug = words.copy() for _ in range(n): i, j = random.sample(range(len(aug)), 2) aug[i], aug[j] = aug[j], aug[i] else: aug = random_deletion(words, alpha) augmented.append(' '.join(aug)) return augmented
python # Test-Time Augmentation inference loop import torch import torchvision.transforms as T def tta_predict(model, image, n_augments=5): """Average predictions over N augmented versions.""" augments = [ T.Compose([]), # original T.RandomHorizontalFlip(p=1.0), # always flip T.RandomCrop(224, padding=16), # random crop T.ColorJitter(brightness=0.2), # brightness shift T.RandomRotation(10), # slight rotation ] preds = [] model.eval() with torch.no_grad(): for aug in augments[:n_augments]: augmented = aug(image) logits = model(augmented.unsqueeze(0)) probs = torch.softmax(logits, dim=1) preds.append(probs) # Average all probability vectors avg_pred = torch.stack(preds).mean(dim=0) return avg_pred # shape: (1, num_classes)
Let’s race them all.
You’ve learned six families of augmentation: None, Geometric (crop+flip), Photometric (color jitter), RandAugment, Mixup, and CutMix. Each has strengths and sweet spots — but reading about them is one thing. Watching them compete in real time is another.
This simulation trains six identical networks on the same classification task, differing only in augmentation strategy. Drag the sliders to create the conditions where each method shines or fails. You’ll discover that no single augmentation dominates everywhere — the right choice depends on your dataset size, domain, and training budget.
Six strategies train simultaneously. Each runs its own augmentation pipeline. Find each method’s sweet spot and failure mode.
Experiment 1: Tiny dataset (100 images). Without augmentation, the model memorizes everything — training accuracy hits 99%, validation plateaus at 55%. Geometric transforms (crop + flip) close half the gap. RandAugment closes more. CutMix or Mixup close the most because they regularize in label space, not just pixel space.
Experiment 2: Large dataset (5000 images). The gap between strategies shrinks dramatically. With enough data, even plain training achieves 85%+ validation accuracy. Augmentation still helps, but the marginal gain is smaller. This confirms the rule: augmentation helps most when data is scarce.
Experiment 3: Heavy strength. Set strength to Heavy. Some methods degrade — 90° rotations and extreme color shifts create unrealistic training examples. RandAugment with heavy magnitude starts producing images that don’t belong to any class. The model can’t learn from noise.
Experiment 4: Mixup vs CutMix. On classification tasks, CutMix typically edges out Mixup because it preserves local image statistics (the model sees sharp patches, not ghostly overlays). But Mixup tends to produce better-calibrated confidence scores.
| Strategy | Best for | Fails when | Typical gain |
|---|---|---|---|
| None | Large datasets (>50k) | Small datasets | Baseline |
| Geometric | All image tasks | Text/documents (flip breaks them) | +2–4% |
| Photometric | Varying lighting conditions | When color is discriminative | +1–2% |
| RandAugment | General purpose | Heavy magnitude + small images | +2–4% |
| Mixup | Calibration-critical tasks | Object detection (ghostly blends) | +1–3% |
| CutMix | Classification + detection | Fine-grained tasks (cut removes discriminative regions) | +2–4% |
You now understand the complete data augmentation toolkit — from basic flips to learned policies to mixing methods. This chapter is your practical reference. No new concepts. Just the recipes, the decision guide, and the connections to where you go next.
| Method | What it changes | Key parameter | When to use |
|---|---|---|---|
| Random Crop | Position | Crop size, scale range | Always (the workhorse) |
| Horizontal Flip | Orientation | p=0.5 | Always (unless text/directional) |
| Rotation | Angle | ±15° typical | Natural images, medical, aerial |
| Color Jitter | Brightness, contrast, saturation, hue | ±0.2 each | Varying lighting conditions |
| Gaussian Noise | Pixel values | σ=25 typical | Robustness to sensor noise |
| Gaussian Blur | Sharpness | kernel 3–7 | Robustness to focus/resolution |
| RandAugment | Random N from 14 ops | N=2, M=9–14 | General purpose (replaces manual) |
| TrivialAugment | Random 1 op at random M | None! | Zero-hyperparameter default |
| Mixup | Pixel blend + soft labels | α=0.2 | Calibration, smooth boundaries |
| CutMix | Rectangular patch + soft labels | α=1.0 | Classification + detection |
Follow the path that matches your situation:
python import torchvision.transforms as T # Recipe 1: ImageNet ResNet (the classic) train_classic = T.Compose([ T.RandomResizedCrop(224, scale=(0.08, 1.0)), T.RandomHorizontalFlip(), T.ColorJitter(0.4, 0.4, 0.4), T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) # Recipe 2: Modern default (RandAugment + CutMix in training loop) train_modern = T.Compose([ T.RandomResizedCrop(224), T.RandomHorizontalFlip(), T.RandAugment(num_ops=2, magnitude=9), T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) # + CutMix/Mixup applied in the training loop on batches # Recipe 3: Zero-effort (TrivialAugment) train_trivial = T.Compose([ T.RandomResizedCrop(224), T.RandomHorizontalFlip(), T.TrivialAugmentWide(), T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ]) # Test transform (ALWAYS deterministic, NEVER augmented) test_transform = T.Compose([ T.Resize(256), T.CenterCrop(224), T.ToTensor(), T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]), ])
Data augmentation doesn’t exist in isolation. Here’s where to go next:
| Paper | Year | Contribution |
|---|---|---|
| Krizhevsky et al., “ImageNet Classification with Deep CNNs” | 2012 | Popularized crop + flip + color jitter for training |
| Cubuk et al., “AutoAugment” | 2018 | RL-based search for augmentation policies |
| Zhang et al., “Mixup: Beyond Empirical Risk Minimization” | 2018 | Linear interpolation of images AND labels |
| Yun et al., “CutMix” | 2019 | Rectangular patch mixing for spatial regularization |
| Wei & Zou, “EDA: Easy Data Augmentation” | 2019 | Four simple text augmentation operations |
| Cubuk et al., “RandAugment” | 2020 | Two hyperparameters replace the entire search |
| Müller & Hutter, “TrivialAugment” | 2021 | Zero hyperparameters, matches or beats AutoAugment |