PyTorch Tensors — From NumPy to GPU Memory Mastery

Chapter 0: Why Tensors?

You want to run Llama-2 7B. The model has 7 billion parameters. Each parameter is a float32 number — 4 bytes. That's 7,000,000,000 × 4 = 28 GB just for the weights. Your GPU has 24 GB of VRAM. The model literally does not fit.

But wait — if you store each parameter as a float16 (2 bytes), that's 14 GB. int8 (1 byte)? 7 GB. Suddenly it fits with room for activations. This isn't a theoretical exercise. Understanding tensor memory at the byte level is the difference between "my model runs" and "CUDA out of memory."

A neural network is just matrix multiplications and element-wise functions. But those matrices live in RAM. GPUs have their own separate memory. And the precision you choose (float32, float16, int8) determines both memory usage AND numerical accuracy. Managing this IS the engineering challenge of deep learning.

The core idea: A tensor is a multi-dimensional array with metadata — its shape, its data type (how many bytes per element), its device (CPU or GPU), and its memory layout (how elements are arranged in RAM). Master these four properties and you master PyTorch.

Memory Budget Calculator

See how dtype choice determines whether a model fits in GPU memory. The red line is your GPU's VRAM limit.

Parameters (B) 7B

GPU VRAM (GB) 24 GB

The visualization makes it visceral: dtype choice is life or death for your training run. A 70B model in float32 needs 280 GB — no single GPU on Earth can hold it. In int4? 35 GB. That fits on an A100.

Key insight: When ML engineers say "quantization," they mean reducing the bytes-per-parameter. When they say "mixed precision," they mean using float16 for most ops and float32 only where needed. Both are tensor-level decisions.

A model has 1 billion parameters stored as float32. How much memory does it use?

1 GB 4 GB 8 GB 32 GB

Chapter 1: Tensor Basics — Shape, Dtype, Device

A tensor is a multi-dimensional array with three essential properties: its shape (how many elements along each dimension), its dtype (how each element is stored in memory), and its device (which hardware holds the data). Every operation in PyTorch flows through these three facts.

Let's create our first tensor and interrogate its properties:

python
import torch

# Create a 2x3 tensor of zeros
x = torch.zeros(2, 3)
print(x.shape)   # torch.Size([2, 3])
print(x.dtype)   # torch.float32
print(x.device)  # cpu

# Memory usage: 2 * 3 * 4 bytes = 24 bytes
print(x.element_size())  # 4 (bytes per float32)
print(x.nelement())      # 6 (total elements)
print(x.element_size() * x.nelement())  # 24 bytes

Let's break this down. A shape of [2, 3] means 2 rows and 3 columns — 6 elements total. Each element is float32 (the default), which takes 4 bytes. So the entire tensor occupies exactly 2 × 3 × 4 = 24 bytes of contiguous memory.

Think of it this way: Shape tells you how many elements. Dtype tells you how big each element is. Multiply them and you get total memory. Device tells you where that memory lives.

Here are the most common creation functions:

python
# Common creation functions
a = torch.zeros(3, 4)           # all zeros
b = torch.ones(3, 4)            # all ones
c = torch.randn(3, 4)           # normal distribution N(0,1)
d = torch.rand(3, 4)            # uniform [0, 1)
e = torch.arange(0, 12)         # [0, 1, 2, ..., 11]
f = torch.tensor([[1,2],[3,4]]) # from Python list
g = torch.empty(3, 4)           # uninitialized (garbage values!)

# From numpy (SHARES memory by default!)
import numpy as np
arr = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(arr)  # shared memory: modify one, both change

Danger: torch.from_numpy() creates a tensor that SHARES memory with the numpy array. If you modify the numpy array, the tensor changes too. Use torch.tensor(arr) for an independent copy.

Worked example: how much memory does a batch of 32 RGB images at 224×224 resolution use?

Shape: [32, 3, 224, 224]
Elements: 32 × 3 × 224 × 224 = 4,816,896
Memory (float32): 4,816,896 × 4 bytes = 19,267,584 bytes ≈ 18.4 MB

Tensor Size Calculator

Enter dimensions and see memory usage for different dtypes. Each colored bar shows bytes consumed.

Dim 0 2

Dim 1 3

Dim 2 1

A tensor has shape [4, 256, 256] and dtype float16. How many bytes does it consume?

262,144 bytes 524,288 bytes 1,048,576 bytes 2,097,152 bytes

Chapter 2: Memory Layout — Strides Explain Everything

Your computer's memory is one-dimensional — a long tape of bytes at consecutive addresses. But tensors are multi-dimensional. How does a 2D matrix live in a 1D tape? The answer is strides.

A stride tells you how many elements to skip in memory to move one step along each dimension. For a [3, 4] tensor stored row-by-row (the default, called row-major or C-contiguous order), the strides are (4, 1). That means: to move to the next row, skip 4 elements. To move to the next column, skip 1 element.

python
import torch

x = torch.arange(12).reshape(3, 4)
print(x)
# tensor([[ 0,  1,  2,  3],
#         [ 4,  5,  6,  7],
#         [ 8,  9, 10, 11]])

print(x.stride())  # (4, 1)
# stride[0]=4: jump 4 elements to go from row i to row i+1
# stride[1]=1: jump 1 element to go from col j to col j+1

# Element at [i, j] lives at memory offset: i*stride[0] + j*stride[1]
# x[2, 1] is at offset 2*4 + 1*1 = 9 --> value is 9 ✓

The stride formula: For any index [i, j, k, ...], the memory offset is i×stride[0] + j×stride[1] + k×stride[2] + ... This single formula governs ALL of PyTorch's memory access patterns.

Column-major (Fortran-contiguous) is the opposite: elements in the same column are contiguous. For a [3, 4] tensor stored column-major, strides would be (1, 3) — jump 1 to move down a row, jump 3 to move right a column. PyTorch defaults to row-major. NumPy lets you choose.

Here's the key insight: transposing a tensor just swaps the strides. No data is copied.

python
x = torch.arange(12).reshape(3, 4)
print(x.stride())    # (4, 1) -- row-major

y = x.t()            # transpose
print(y.shape)       # torch.Size([4, 3])
print(y.stride())    # (1, 4) -- strides swapped!
print(y.is_contiguous())  # False

# Same underlying data, different view
print(x.data_ptr() == y.data_ptr())  # True -- same memory!

Interactive Stride Explorer

Click any cell in the 2D grid to see which memory address it maps to. Change strides to see how indexing changes. Orange = selected cell. Teal = memory position.

Stride[0] (row) 4

Stride[1] (col) 1

Why strides matter: CPUs and GPUs load data in cache lines (64 bytes typically). If your access pattern jumps randomly through memory (non-contiguous), you get cache misses. Contiguous access = fast. Non-contiguous = slow. This is why .contiguous() exists — it rearranges data for fast sequential access.

A tensor has shape [3, 4] and strides (4, 1). What memory offset does element [1, 2] map to?

5 3 6 8

Chapter 3: Views vs Copies — Shared Memory Traps

When you reshape, transpose, or slice a tensor in PyTorch, does it copy the data? Usually no. Most operations create a view — a new tensor object that points to the SAME underlying memory with different metadata (shape, strides, offset). This is fast (O(1) instead of O(n)) but dangerous: modifying the view modifies the original.

Operations that create views (shared memory):

python
x = torch.arange(12).reshape(3, 4)

# ALL of these share memory with x:
a = x.view(6, 2)       # reshape (must be contiguous)
b = x.reshape(6, 2)    # reshape (may copy if non-contiguous)
c = x.t()              # transpose
d = x[0]               # indexing a row
e = x[:, 1]           # indexing a column
f = x.expand(5,3,4)   # broadcast expand (stride=0!)
g = x.permute(1,0)    # dimension reorder
h = x.unsqueeze(0)    # add dimension

# Proof: modify the view, original changes
a[0, 0] = 999
print(x[0, 0])  # 999 -- x changed too!

Operations that create copies (independent memory):

python
# These allocate NEW memory:
a = x.clone()          # explicit deep copy
b = x.contiguous()     # copy only if non-contiguous
c = x + 0              # any arithmetic creates new tensor
d = x.to(torch.float16)  # dtype cast = new allocation
e = torch.tensor(x)   # explicit new tensor from data

The view/copy trap: .reshape() returns a view when possible but silently copies when it can't (non-contiguous input). .view() is strict — it raises an error if a view isn't possible. Use .view() when you NEED shared memory, .reshape() when you don't care.

The key question: when is a tensor non-contiguous? After any operation that changes strides without moving data:

python
x = torch.arange(12).reshape(3, 4)
print(x.is_contiguous())  # True -- strides (4,1), data is sequential

y = x.t()  # transpose: shape [4,3], strides (1,4)
print(y.is_contiguous())  # False!

# Why? For y to be contiguous with shape [4,3], strides must be (3,1)
# But y has strides (1,4) -- elements aren't sequential in memory

# Fix it:
z = y.contiguous()  # copies data into new sequential memory
print(z.stride())   # (3, 1) -- now contiguous
print(z.data_ptr() == x.data_ptr())  # False -- different memory

View vs Copy Detector

Watch how operations affect memory sharing. Teal boxes share memory (view). Orange boxes have their own memory (copy). Click operations to apply them.

Performance insight: Views are O(1) — they just create new metadata. Copies are O(n) — they allocate and fill new memory. A model with billions of parameters can't afford unnecessary copies. This is why PyTorch engineers obsess over views.

You transpose a tensor and then call .view(). What happens?

It works fine and returns a view It silently makes a copy It raises a RuntimeError because the tensor is non-contiguous It transposes back automatically

Chapter 4: Broadcasting — Implicit Shape Alignment

You have a [3, 4] matrix and want to add a [4] vector to each row. In raw code, you'd write a loop. But PyTorch does it automatically through broadcasting — the rules for stretching tensors to compatible shapes for element-wise operations.

The broadcasting rules are simple but must be followed exactly:

Step 1

Right-align the shapes. Pad the shorter shape with 1s on the left.

↓

Step 2

For each dimension: sizes must match OR one of them must be 1.

↓

Step 3

The dimension of size 1 is "stretched" to match the other.

python
import torch

# Example 1: [3,4] + [4] --> works!
A = torch.ones(3, 4)   # shape [3, 4]
b = torch.arange(4)    # shape [4]
# Step 1: pad b to [1, 4]
# Step 2: dim 0: 3 vs 1 ✓ (1 stretches)  dim 1: 4 vs 4 ✓ (match)
# Result: [3, 4]
result = A + b  # b is added to each row

# Example 2: [3,1] + [1,4] --> [3,4]!
col = torch.tensor([[1],[2],[3]])  # shape [3, 1]
row = torch.tensor([[10,20,30,40]])  # shape [1, 4]
# dim 0: 3 vs 1 ✓   dim 1: 1 vs 4 ✓
result = col + row  # outer sum! shape [3, 4]
# [[11,21,31,41],
#  [12,22,32,42],
#  [13,23,33,43]]

# Example 3: [3,4] + [3] --> ERROR!
# Step 1: pad to [1, 3]
# Step 2: dim 1: 4 vs 3 -- neither is 1! FAIL

Common broadcasting mistake: Trying to add a [3] vector to a [3, 4] matrix expecting it to broadcast along columns. It doesn't — PyTorch right-aligns, so [3] becomes [1, 3] and fails against dim 1=4. You need to reshape to [3, 1] first: vec.unsqueeze(1) or vec[:, None].

Broadcasting is lazy — it never actually copies data. Under the hood, it uses stride=0 for stretched dimensions. A [3, 1] tensor broadcast to [3, 4] has strides (1, 0) — moving along the stretched dimension doesn't advance in memory. Zero memory cost!

python
x = torch.tensor([[1],[2],[3]])  # shape [3,1], strides (1,1)
y = x.expand(3, 4)              # shape [3,4], strides (1,0)!
print(y.stride())  # (1, 0) -- stride 0 means "repeat this element"
print(y)
# tensor([[1, 1, 1, 1],
#         [2, 2, 2, 2],
#         [3, 3, 3, 3]])
# Only 3 elements in memory, but LOOKS like 12!

Broadcasting Visualizer

Watch how two tensors align and stretch for an element-wise add. Orange = tensor A. Teal = tensor B. Purple = result.

Real-world usage: Broadcasting is everywhere in deep learning. Batch normalization subtracts a [C] mean from a [B, C, H, W] tensor. Attention masking adds a [1, 1, seq, seq] mask to a [B, heads, seq, seq] score. Without broadcasting, you'd need explicit expand() calls — or worse, loops.

Can you add a tensor of shape [2, 3] to a tensor of shape [3, 2]?

Yes, the result is [3, 3] No — after right-aligning, dim 0 is 2 vs 3 (neither is 1) and dim 1 is 3 vs 2 (neither is 1) Yes, PyTorch transposes one automatically

Chapter 5: Dtypes & Precision — Every Bit Counts

A dtype (data type) determines how each number is encoded in binary. More bits = more precision = more memory. The four dtypes you'll encounter in deep learning are:

Dtype	Bits	Bytes	Range	Precision	Use Case
float32	32	4	±3.4×10³⁸	~7 decimal digits	Default training
float16	16	2	±65,504	~3 decimal digits	Mixed precision fwd
bfloat16	16	2	±3.4×10³⁸	~2 decimal digits	Training (Google TPU)
int8	8	1	-128 to 127	Exact integers	Quantized inference

Key insight — bfloat16 vs float16: float16 has 5 exponent bits and 10 mantissa bits. bfloat16 has 8 exponent bits and 7 mantissa bits. Same size, but bfloat16 can represent the same RANGE as float32 (it won't overflow at 65,505). It just has less precision. For training, range matters more than precision — gradients can be very large or very small.

The floating-point format is: (-1)^sign × 2^exponent × (1 + mantissa). More exponent bits = bigger range. More mantissa bits = finer granularity between representable numbers.

python
import torch

# Dtype casting
x = torch.randn(3, 3)                # float32 by default
x16 = x.half()                        # float16
xbf = x.bfloat16()                    # bfloat16
x8 = x.to(torch.int8)                 # int8 (truncates!)

# float16 overflow danger:
big = torch.tensor(70000.0)
print(big.half())   # tensor(inf) -- overflow! 70000 > 65504
print(big.bfloat16())  # tensor(69632.) -- representable (lower precision)

# Mixed precision training pattern:
model = model.float()  # master weights in float32
with torch.cuda.amp.autocast():
    # Forward pass in float16 (2x faster on GPU)
    output = model(input.half())
    loss = criterion(output, target)
# Backward in float32 for numerical stability
scaler.scale(loss).backward()

The precision trap: When you sum many float16 numbers, rounding errors accumulate catastrophically. A loss of 0.001 repeated 1000 times might round to 0 in float16. This is why optimizers (Adam, SGD) keep master weights in float32 — only the forward pass uses reduced precision.

Precision Comparison: Sine Wave in Different Dtypes

The same sine wave stored at different precisions. Watch where float16 clips (amplitude > 65504) and where int8 staircase appears.

Amplitude 100

Frequency 3

Notice: at amplitude=100, all dtypes look the same. At amplitude=70000, float16 clips to infinity while bfloat16 tracks correctly (with visible stairstepping from low precision). At any amplitude, int8 looks like a staircase because it can only represent 256 distinct values.

Why is bfloat16 preferred over float16 for training?

It has the same exponent range as float32, so gradients don't overflow It's faster to compute with It uses less memory than float16 It's more precise than float16

Chapter 6: GPU Tensors — The Transfer Bottleneck

Your CPU has RAM (system memory). Your GPU has VRAM (video memory). These are physically separate pools of memory connected by a PCIe bus (or NVLink on high-end systems). Moving data between them is the #1 performance bottleneck in deep learning.

A PCIe 4.0 x16 bus transfers at ~25 GB/s. Sounds fast? An A100 GPU computes matmul at 312 TFLOPS. For a [4096, 4096] float16 matrix (32 MB), transfer takes 1.3 ms but the matmul takes 0.04 ms. The transfer is 30x slower than the computation.

python
import torch

# Move tensor to GPU
x = torch.randn(1000, 1000)
x_gpu = x.to('cuda')        # copies to GPU memory
x_gpu = x.cuda()            # same thing
x_gpu = x.to('cuda:0')     # specific GPU

# Create directly on GPU (no transfer!)
y = torch.randn(1000, 1000, device='cuda')

# Move back to CPU
x_cpu = x_gpu.cpu()
x_cpu = x_gpu.to('cpu')

# DANGER: can't mix devices!
# x + x_gpu  --> RuntimeError: tensors on different devices

The silent killer: Every .to('cuda') and .cpu() call triggers a synchronous transfer. If you do this inside a training loop (e.g., moving labels to GPU each batch), you stall the GPU waiting for data. Pre-load everything to GPU ONCE, or use pin_memory=True in your DataLoader.

Pinned memory is a special CPU memory region that the GPU can access directly via DMA (Direct Memory Access). Normal CPU memory might be swapped to disk, so the driver must first copy it to a pinned staging area. If YOU pin it upfront, the driver skips that step:

python
# Pinned memory for faster transfers
x = torch.randn(1000, 1000).pin_memory()
x_gpu = x.to('cuda', non_blocking=True)  # async transfer!

# DataLoader with pinned memory
loader = DataLoader(dataset, batch_size=32,
                    pin_memory=True,   # allocate batches in pinned RAM
                    num_workers=4)    # parallel data loading

# In training loop:
for batch, labels in loader:
    batch = batch.to('cuda', non_blocking=True)
    labels = labels.to('cuda', non_blocking=True)
    # GPU compute overlaps with next batch's transfer

CUDA streams: A GPU can execute multiple operations concurrently using streams. The default stream serializes everything. You can create additional streams to overlap data transfer with computation — while batch N is computing, batch N+1 is transferring. This is how production training pipelines saturate GPU utilization.

CPU↔GPU Transfer Timeline

See where time is spent: transfer vs compute. Increase tensor size to see transfer dominate. Enable pinned memory to see overlap.

Tensor size (MB) 32 MB

Why should you create tensors directly on GPU with device='cuda' instead of creating on CPU and calling .cuda()?

It uses less GPU memory It avoids a CPU→GPU transfer entirely It enables mixed precision automatically

Chapter 7: Tensor Playground — Build & Inspect

Time to put it all together. This interactive playground lets you create tensors, apply operations, and see exactly what happens to shape, strides, memory, and contiguity at each step. Think of it as a visual debugger for tensor operations.

How to use: Start by picking a tensor shape. Then chain operations — each one shows the new shape, strides, memory usage, and whether the result is a view or a copy. Try to predict the output before clicking!

Interactive Tensor Playground

Build a chain of tensor operations. Watch shape, strides, contiguity, and memory change with each step.

Experiments to try in the playground:

Experiment 1 — The transpose trap: Start with [4,4]. Transpose. Check strides — they swap. Now try Reshape. It fails (non-contiguous). Click Contiguous first, THEN Reshape. Notice it became a copy.

Experiment 2 — Broadcasting memory: Start with [3,1,4]. Click Expand. Watch strides — the expanded dimension gets stride=0. The tensor LOOKS bigger but uses the same memory. Now Clone it — memory jumps because it materializes the expanded data.

Experiment 3 — Dtype impact: Start with [6,8] in float32 (192 bytes). Switch to float16 (96 bytes). Switch to int8 (48 bytes). Same shape, 4× less memory. This is quantization in a nutshell.

The playground shows you the exact internal state PyTorch maintains for every tensor: a data pointer, shape tuple, strides tuple, dtype, and device. Every operation is just a transformation of this metadata — or, for copies, allocation of new memory + metadata.

Chapter 8: Sparse & Quantized — Beyond Dense Floats

Dense tensors waste memory when most elements are zero. A [10000, 10000] matrix with only 1% non-zero values stores 100 million floats — but only 1 million matter. Sparse tensors store only the non-zero elements and their coordinates.

PyTorch supports two sparse formats:

Format	Storage	Best For	Memory
COO (Coordinate)	List of (row, col, value) tuples	Construction, random access	3 × nnz
CSR (Compressed Sparse Row)	Row pointers + col indices + values	Row-wise operations, matmul	2 × nnz + nrows

python
import torch

# COO format: specify indices and values
indices = torch.tensor([[0, 1, 2],    # row indices
                        [2, 0, 1]])   # col indices
values = torch.tensor([3.0, 4.0, 5.0])
sparse = torch.sparse_coo_tensor(indices, values, (3, 3))

# Dense version would use 9 floats = 36 bytes
# Sparse uses 6 ints (indices) + 3 floats (values) = 36 bytes
# Break-even! Sparse only wins when sparsity > ~67%

# CSR format (better for computation)
crow = torch.tensor([0, 1, 2, 3])   # row pointers
col = torch.tensor([2, 0, 1])      # column indices
vals = torch.tensor([3.0, 4.0, 5.0])
csr = torch.sparse_csr_tensor(crow, col, vals, (3, 3))

When to use sparse: Attention masks (causal mask is 50% zeros). Graph adjacency matrices (99%+ zeros). Embedding layers with rare tokens. Activation maps after ReLU (often 50-90% zeros). Rule of thumb: sparse format wins when >67% of elements are zero.

Quantized tensors represent floating-point values using integers plus a scale factor. The idea: if your weights range from -1.5 to +1.5, you can map this to int8 (-128 to +127) with scale = 3.0/255:

q = round(x / scale) + zero_point
x ≈ (q - zero_point) × scale

python
import torch

# Manual quantization
x = torch.randn(4, 4)  # float32 weights
scale = (x.max() - x.min()) / 255
zero_point = int((-x.min() / scale).round())
q = (x / scale + zero_point).round().clamp(0, 255).to(torch.uint8)

# Dequantize
x_approx = (q.float() - zero_point) * scale
print((x - x_approx).abs().max())  # max error ~ scale/2

# PyTorch quantization API
x_q = torch.quantize_per_tensor(x, scale=0.01, zero_point=128,
                                 dtype=torch.quint8)
print(x_q.int_repr())  # the underlying int8 values

Quantization types: Per-tensor uses one scale for the whole tensor (simplest, least accurate). Per-channel uses one scale per output channel (better for conv weights). Per-group (GPTQ, AWQ) uses one scale per N elements (best accuracy, used in LLM quantization). The finer the granularity, the better the approximation — but the more scale factors you store.

Sparsity vs Memory Savings

See how memory usage compares between dense and sparse formats as sparsity increases. The crossover point is where sparse wins.

Sparsity % 50%

Matrix size (N×N) 100

A 1000×1000 matrix has 95% zeros. Approximately how much memory does the COO sparse format save compared to dense float32?

50% savings 75% savings ~85% savings (COO stores ~3 values per non-zero vs 1 per element dense) 95% savings

Chapter 9: Mastery & Connections

You now understand tensors at the byte level — shape, dtype, device, strides, views, copies, broadcasting, and memory formats. This is the foundation every PyTorch operation is built on. Let's consolidate with reference tables and connections to the broader ML engineering stack.

Cheat Sheet: Creation Functions

Function	Description	Notes
`torch.zeros()`	All zeros	Safe initialization
`torch.ones()`	All ones	Mask initialization
`torch.randn()`	Normal N(0,1)	Weight initialization
`torch.rand()`	Uniform [0,1)	Dropout masks
`torch.empty()`	Uninitialized	Fastest (but dangerous!)
`torch.arange()`	Sequence	Position encodings
`torch.eye()`	Identity matrix	Initializations
`torch.from_numpy()`	From numpy	SHARES memory!
`torch.tensor()`	From data	Always copies

Cheat Sheet: Memory Rules

Rule	Details
View operations	reshape, view, transpose, permute, slice, unsqueeze, expand
Copy operations	clone, contiguous (if needed), to(dtype), arithmetic ops
Contiguous check	Strides must equal product of subsequent dimensions
Stride=0 means	Broadcast expansion (no extra memory)
Memory = elements × bytes_per_element	Always. No exceptions.

Cheat Sheet: Broadcasting Rules

Step	Rule
1. Align	Right-align shapes, pad shorter with 1s on left
2. Compare	Each dim must match OR be 1
3. Expand	Dims of size 1 stretch to match (stride=0, no copy)

Where This Goes Next

Autograd: Every tensor can track its computational history (requires_grad=True). When you call .backward(), PyTorch traverses this graph to compute gradients. Understanding tensor memory is prerequisite to understanding autograd — gradients are tensors too, and they double your memory usage.

Training loops: A training step is: load batch (CPU→GPU transfer), forward pass (GPU matmuls in float16), loss computation, backward pass (float32 gradients), optimizer step (update float32 master weights). Every concept from this lesson appears in that one sentence.

Topic	Builds On	Where
Autograd & Backprop	Views, memory, dtype	Next lesson
Mixed Precision Training	float16, bfloat16, GPU transfer	Training loop lesson
Model Parallelism	Device placement, memory budgets	Distributed lesson
Quantization (GPTQ, AWQ)	int8, per-channel scaling	Inference lesson
Flash Attention	Memory layout, contiguity	Transformer lesson

Closing thought: "The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise." — Edsger Dijkstra. Tensors are PyTorch's fundamental abstraction. Master them, and every higher-level concept (modules, optimizers, distributed training) becomes a composition of tensor operations you fully understand.

You have a [B, C, H, W] activation tensor in float32 on GPU. You want to save 50% memory without changing the computation. What do you do?

Move it to CPU Cast to float16 or bfloat16 (2 bytes instead of 4) Transpose it Use a sparse format

PyTorch Tensors &Computation