The Complete Beginner's Path

PyTorch Tensors &
Computation

From numpy arrays to GPU tensors, broadcasting, views vs copies, memory layout, and dtype precision — the byte-level engineering that makes deep learning possible.

Prerequisites: Basic Python + Familiarity with arrays/lists. That's it.
10
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: Why Tensors?

You want to run Llama-2 7B. The model has 7 billion parameters. Each parameter is a float32 number — 4 bytes. That's 7,000,000,000 × 4 = 28 GB just for the weights. Your GPU has 24 GB of VRAM. The model literally does not fit.

But wait — if you store each parameter as a float16 (2 bytes), that's 14 GB. int8 (1 byte)? 7 GB. Suddenly it fits with room for activations. This isn't a theoretical exercise. Understanding tensor memory at the byte level is the difference between "my model runs" and "CUDA out of memory."

A neural network is just matrix multiplications and element-wise functions. But those matrices live in RAM. GPUs have their own separate memory. And the precision you choose (float32, float16, int8) determines both memory usage AND numerical accuracy. Managing this IS the engineering challenge of deep learning.

The core idea: A tensor is a multi-dimensional array with metadata — its shape, its data type (how many bytes per element), its device (CPU or GPU), and its memory layout (how elements are arranged in RAM). Master these four properties and you master PyTorch.
Memory Budget Calculator

See how dtype choice determines whether a model fits in GPU memory. The red line is your GPU's VRAM limit.

Parameters (B) 7B
GPU VRAM (GB) 24 GB

The visualization makes it visceral: dtype choice is life or death for your training run. A 70B model in float32 needs 280 GB — no single GPU on Earth can hold it. In int4? 35 GB. That fits on an A100.

Key insight: When ML engineers say "quantization," they mean reducing the bytes-per-parameter. When they say "mixed precision," they mean using float16 for most ops and float32 only where needed. Both are tensor-level decisions.
A model has 1 billion parameters stored as float32. How much memory does it use?

Chapter 1: Tensor Basics — Shape, Dtype, Device

A tensor is a multi-dimensional array with three essential properties: its shape (how many elements along each dimension), its dtype (how each element is stored in memory), and its device (which hardware holds the data). Every operation in PyTorch flows through these three facts.

Let's create our first tensor and interrogate its properties:

python
import torch

# Create a 2x3 tensor of zeros
x = torch.zeros(2, 3)
print(x.shape)   # torch.Size([2, 3])
print(x.dtype)   # torch.float32
print(x.device)  # cpu

# Memory usage: 2 * 3 * 4 bytes = 24 bytes
print(x.element_size())  # 4 (bytes per float32)
print(x.nelement())      # 6 (total elements)
print(x.element_size() * x.nelement())  # 24 bytes

Let's break this down. A shape of [2, 3] means 2 rows and 3 columns — 6 elements total. Each element is float32 (the default), which takes 4 bytes. So the entire tensor occupies exactly 2 × 3 × 4 = 24 bytes of contiguous memory.

Think of it this way: Shape tells you how many elements. Dtype tells you how big each element is. Multiply them and you get total memory. Device tells you where that memory lives.

Here are the most common creation functions:

python
# Common creation functions
a = torch.zeros(3, 4)           # all zeros
b = torch.ones(3, 4)            # all ones
c = torch.randn(3, 4)           # normal distribution N(0,1)
d = torch.rand(3, 4)            # uniform [0, 1)
e = torch.arange(0, 12)         # [0, 1, 2, ..., 11]
f = torch.tensor([[1,2],[3,4]]) # from Python list
g = torch.empty(3, 4)           # uninitialized (garbage values!)

# From numpy (SHARES memory by default!)
import numpy as np
arr = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(arr)  # shared memory: modify one, both change
Danger: torch.from_numpy() creates a tensor that SHARES memory with the numpy array. If you modify the numpy array, the tensor changes too. Use torch.tensor(arr) for an independent copy.

Worked example: how much memory does a batch of 32 RGB images at 224×224 resolution use?

Shape: [32, 3, 224, 224]
Elements: 32 × 3 × 224 × 224 = 4,816,896
Memory (float32): 4,816,896 × 4 bytes = 19,267,584 bytes ≈ 18.4 MB
Tensor Size Calculator

Enter dimensions and see memory usage for different dtypes. Each colored bar shows bytes consumed.

Dim 0 2
Dim 1 3
Dim 2 1
A tensor has shape [4, 256, 256] and dtype float16. How many bytes does it consume?

Chapter 2: Memory Layout — Strides Explain Everything

Your computer's memory is one-dimensional — a long tape of bytes at consecutive addresses. But tensors are multi-dimensional. How does a 2D matrix live in a 1D tape? The answer is strides.

A stride tells you how many elements to skip in memory to move one step along each dimension. For a [3, 4] tensor stored row-by-row (the default, called row-major or C-contiguous order), the strides are (4, 1). That means: to move to the next row, skip 4 elements. To move to the next column, skip 1 element.

python
import torch

x = torch.arange(12).reshape(3, 4)
print(x)
# tensor([[ 0,  1,  2,  3],
#         [ 4,  5,  6,  7],
#         [ 8,  9, 10, 11]])

print(x.stride())  # (4, 1)
# stride[0]=4: jump 4 elements to go from row i to row i+1
# stride[1]=1: jump 1 element to go from col j to col j+1

# Element at [i, j] lives at memory offset: i*stride[0] + j*stride[1]
# x[2, 1] is at offset 2*4 + 1*1 = 9 --> value is 9 ✓
The stride formula: For any index [i, j, k, ...], the memory offset is i×stride[0] + j×stride[1] + k×stride[2] + ... This single formula governs ALL of PyTorch's memory access patterns.

Column-major (Fortran-contiguous) is the opposite: elements in the same column are contiguous. For a [3, 4] tensor stored column-major, strides would be (1, 3) — jump 1 to move down a row, jump 3 to move right a column. PyTorch defaults to row-major. NumPy lets you choose.

Here's the key insight: transposing a tensor just swaps the strides. No data is copied.

python
x = torch.arange(12).reshape(3, 4)
print(x.stride())    # (4, 1) -- row-major

y = x.t()            # transpose
print(y.shape)       # torch.Size([4, 3])
print(y.stride())    # (1, 4) -- strides swapped!
print(y.is_contiguous())  # False

# Same underlying data, different view
print(x.data_ptr() == y.data_ptr())  # True -- same memory!
Interactive Stride Explorer

Click any cell in the 2D grid to see which memory address it maps to. Change strides to see how indexing changes. Orange = selected cell. Teal = memory position.

Stride[0] (row) 4
Stride[1] (col) 1
Why strides matter: CPUs and GPUs load data in cache lines (64 bytes typically). If your access pattern jumps randomly through memory (non-contiguous), you get cache misses. Contiguous access = fast. Non-contiguous = slow. This is why .contiguous() exists — it rearranges data for fast sequential access.
A tensor has shape [3, 4] and strides (4, 1). What memory offset does element [1, 2] map to?

Chapter 3: Views vs Copies — Shared Memory Traps

When you reshape, transpose, or slice a tensor in PyTorch, does it copy the data? Usually no. Most operations create a view — a new tensor object that points to the SAME underlying memory with different metadata (shape, strides, offset). This is fast (O(1) instead of O(n)) but dangerous: modifying the view modifies the original.

Operations that create views (shared memory):

python
x = torch.arange(12).reshape(3, 4)

# ALL of these share memory with x:
a = x.view(6, 2)       # reshape (must be contiguous)
b = x.reshape(6, 2)    # reshape (may copy if non-contiguous)
c = x.t()              # transpose
d = x[0]               # indexing a row
e = x[:, 1]           # indexing a column
f = x.expand(5,3,4)   # broadcast expand (stride=0!)
g = x.permute(1,0)    # dimension reorder
h = x.unsqueeze(0)    # add dimension

# Proof: modify the view, original changes
a[0, 0] = 999
print(x[0, 0])  # 999 -- x changed too!

Operations that create copies (independent memory):

python
# These allocate NEW memory:
a = x.clone()          # explicit deep copy
b = x.contiguous()     # copy only if non-contiguous
c = x + 0              # any arithmetic creates new tensor
d = x.to(torch.float16)  # dtype cast = new allocation
e = torch.tensor(x)   # explicit new tensor from data
The view/copy trap: .reshape() returns a view when possible but silently copies when it can't (non-contiguous input). .view() is strict — it raises an error if a view isn't possible. Use .view() when you NEED shared memory, .reshape() when you don't care.

The key question: when is a tensor non-contiguous? After any operation that changes strides without moving data:

python
x = torch.arange(12).reshape(3, 4)
print(x.is_contiguous())  # True -- strides (4,1), data is sequential

y = x.t()  # transpose: shape [4,3], strides (1,4)
print(y.is_contiguous())  # False!

# Why? For y to be contiguous with shape [4,3], strides must be (3,1)
# But y has strides (1,4) -- elements aren't sequential in memory

# Fix it:
z = y.contiguous()  # copies data into new sequential memory
print(z.stride())   # (3, 1) -- now contiguous
print(z.data_ptr() == x.data_ptr())  # False -- different memory
View vs Copy Detector

Watch how operations affect memory sharing. Teal boxes share memory (view). Orange boxes have their own memory (copy). Click operations to apply them.

Performance insight: Views are O(1) — they just create new metadata. Copies are O(n) — they allocate and fill new memory. A model with billions of parameters can't afford unnecessary copies. This is why PyTorch engineers obsess over views.
You transpose a tensor and then call .view(). What happens?

Chapter 4: Broadcasting — Implicit Shape Alignment

You have a [3, 4] matrix and want to add a [4] vector to each row. In raw code, you'd write a loop. But PyTorch does it automatically through broadcasting — the rules for stretching tensors to compatible shapes for element-wise operations.

The broadcasting rules are simple but must be followed exactly:

Step 1
Right-align the shapes. Pad the shorter shape with 1s on the left.
Step 2
For each dimension: sizes must match OR one of them must be 1.
Step 3
The dimension of size 1 is "stretched" to match the other.
python
import torch

# Example 1: [3,4] + [4] --> works!
A = torch.ones(3, 4)   # shape [3, 4]
b = torch.arange(4)    # shape [4]
# Step 1: pad b to [1, 4]
# Step 2: dim 0: 3 vs 1 ✓ (1 stretches)  dim 1: 4 vs 4 ✓ (match)
# Result: [3, 4]
result = A + b  # b is added to each row

# Example 2: [3,1] + [1,4] --> [3,4]!
col = torch.tensor([[1],[2],[3]])  # shape [3, 1]
row = torch.tensor([[10,20,30,40]])  # shape [1, 4]
# dim 0: 3 vs 1 ✓   dim 1: 1 vs 4 ✓
result = col + row  # outer sum! shape [3, 4]
# [[11,21,31,41],
#  [12,22,32,42],
#  [13,23,33,43]]

# Example 3: [3,4] + [3] --> ERROR!
# Step 1: pad to [1, 3]
# Step 2: dim 1: 4 vs 3 -- neither is 1! FAIL
Common broadcasting mistake: Trying to add a [3] vector to a [3, 4] matrix expecting it to broadcast along columns. It doesn't — PyTorch right-aligns, so [3] becomes [1, 3] and fails against dim 1=4. You need to reshape to [3, 1] first: vec.unsqueeze(1) or vec[:, None].

Broadcasting is lazy — it never actually copies data. Under the hood, it uses stride=0 for stretched dimensions. A [3, 1] tensor broadcast to [3, 4] has strides (1, 0) — moving along the stretched dimension doesn't advance in memory. Zero memory cost!

python
x = torch.tensor([[1],[2],[3]])  # shape [3,1], strides (1,1)
y = x.expand(3, 4)              # shape [3,4], strides (1,0)!
print(y.stride())  # (1, 0) -- stride 0 means "repeat this element"
print(y)
# tensor([[1, 1, 1, 1],
#         [2, 2, 2, 2],
#         [3, 3, 3, 3]])
# Only 3 elements in memory, but LOOKS like 12!
Broadcasting Visualizer

Watch how two tensors align and stretch for an element-wise add. Orange = tensor A. Teal = tensor B. Purple = result.

Real-world usage: Broadcasting is everywhere in deep learning. Batch normalization subtracts a [C] mean from a [B, C, H, W] tensor. Attention masking adds a [1, 1, seq, seq] mask to a [B, heads, seq, seq] score. Without broadcasting, you'd need explicit expand() calls — or worse, loops.
Can you add a tensor of shape [2, 3] to a tensor of shape [3, 2]?

Chapter 5: Dtypes & Precision — Every Bit Counts

A dtype (data type) determines how each number is encoded in binary. More bits = more precision = more memory. The four dtypes you'll encounter in deep learning are:

DtypeBitsBytesRangePrecisionUse Case
float32324±3.4×1038~7 decimal digitsDefault training
float16162±65,504~3 decimal digitsMixed precision fwd
bfloat16162±3.4×1038~2 decimal digitsTraining (Google TPU)
int881-128 to 127Exact integersQuantized inference
Key insight — bfloat16 vs float16: float16 has 5 exponent bits and 10 mantissa bits. bfloat16 has 8 exponent bits and 7 mantissa bits. Same size, but bfloat16 can represent the same RANGE as float32 (it won't overflow at 65,505). It just has less precision. For training, range matters more than precision — gradients can be very large or very small.

The floating-point format is: (-1)sign × 2exponent × (1 + mantissa). More exponent bits = bigger range. More mantissa bits = finer granularity between representable numbers.

python
import torch

# Dtype casting
x = torch.randn(3, 3)                # float32 by default
x16 = x.half()                        # float16
xbf = x.bfloat16()                    # bfloat16
x8 = x.to(torch.int8)                 # int8 (truncates!)

# float16 overflow danger:
big = torch.tensor(70000.0)
print(big.half())   # tensor(inf) -- overflow! 70000 > 65504
print(big.bfloat16())  # tensor(69632.) -- representable (lower precision)

# Mixed precision training pattern:
model = model.float()  # master weights in float32
with torch.cuda.amp.autocast():
    # Forward pass in float16 (2x faster on GPU)
    output = model(input.half())
    loss = criterion(output, target)
# Backward in float32 for numerical stability
scaler.scale(loss).backward()
The precision trap: When you sum many float16 numbers, rounding errors accumulate catastrophically. A loss of 0.001 repeated 1000 times might round to 0 in float16. This is why optimizers (Adam, SGD) keep master weights in float32 — only the forward pass uses reduced precision.
Precision Comparison: Sine Wave in Different Dtypes

The same sine wave stored at different precisions. Watch where float16 clips (amplitude > 65504) and where int8 staircase appears.

Amplitude 100
Frequency 3

Notice: at amplitude=100, all dtypes look the same. At amplitude=70000, float16 clips to infinity while bfloat16 tracks correctly (with visible stairstepping from low precision). At any amplitude, int8 looks like a staircase because it can only represent 256 distinct values.

Why is bfloat16 preferred over float16 for training?

Chapter 6: GPU Tensors — The Transfer Bottleneck

Your CPU has RAM (system memory). Your GPU has VRAM (video memory). These are physically separate pools of memory connected by a PCIe bus (or NVLink on high-end systems). Moving data between them is the #1 performance bottleneck in deep learning.

A PCIe 4.0 x16 bus transfers at ~25 GB/s. Sounds fast? An A100 GPU computes matmul at 312 TFLOPS. For a [4096, 4096] float16 matrix (32 MB), transfer takes 1.3 ms but the matmul takes 0.04 ms. The transfer is 30x slower than the computation.

python
import torch

# Move tensor to GPU
x = torch.randn(1000, 1000)
x_gpu = x.to('cuda')        # copies to GPU memory
x_gpu = x.cuda()            # same thing
x_gpu = x.to('cuda:0')     # specific GPU

# Create directly on GPU (no transfer!)
y = torch.randn(1000, 1000, device='cuda')

# Move back to CPU
x_cpu = x_gpu.cpu()
x_cpu = x_gpu.to('cpu')

# DANGER: can't mix devices!
# x + x_gpu  --> RuntimeError: tensors on different devices
The silent killer: Every .to('cuda') and .cpu() call triggers a synchronous transfer. If you do this inside a training loop (e.g., moving labels to GPU each batch), you stall the GPU waiting for data. Pre-load everything to GPU ONCE, or use pin_memory=True in your DataLoader.

Pinned memory is a special CPU memory region that the GPU can access directly via DMA (Direct Memory Access). Normal CPU memory might be swapped to disk, so the driver must first copy it to a pinned staging area. If YOU pin it upfront, the driver skips that step:

python
# Pinned memory for faster transfers
x = torch.randn(1000, 1000).pin_memory()
x_gpu = x.to('cuda', non_blocking=True)  # async transfer!

# DataLoader with pinned memory
loader = DataLoader(dataset, batch_size=32,
                    pin_memory=True,   # allocate batches in pinned RAM
                    num_workers=4)    # parallel data loading

# In training loop:
for batch, labels in loader:
    batch = batch.to('cuda', non_blocking=True)
    labels = labels.to('cuda', non_blocking=True)
    # GPU compute overlaps with next batch's transfer
CUDA streams: A GPU can execute multiple operations concurrently using streams. The default stream serializes everything. You can create additional streams to overlap data transfer with computation — while batch N is computing, batch N+1 is transferring. This is how production training pipelines saturate GPU utilization.
CPU↔GPU Transfer Timeline

See where time is spent: transfer vs compute. Increase tensor size to see transfer dominate. Enable pinned memory to see overlap.

Tensor size (MB) 32 MB
Why should you create tensors directly on GPU with device='cuda' instead of creating on CPU and calling .cuda()?

Chapter 7: Tensor Playground — Build & Inspect

Time to put it all together. This interactive playground lets you create tensors, apply operations, and see exactly what happens to shape, strides, memory, and contiguity at each step. Think of it as a visual debugger for tensor operations.

How to use: Start by picking a tensor shape. Then chain operations — each one shows the new shape, strides, memory usage, and whether the result is a view or a copy. Try to predict the output before clicking!
Interactive Tensor Playground

Build a chain of tensor operations. Watch shape, strides, contiguity, and memory change with each step.

Experiments to try in the playground:

Experiment 1 — The transpose trap: Start with [4,4]. Transpose. Check strides — they swap. Now try Reshape. It fails (non-contiguous). Click Contiguous first, THEN Reshape. Notice it became a copy.
Experiment 2 — Broadcasting memory: Start with [3,1,4]. Click Expand. Watch strides — the expanded dimension gets stride=0. The tensor LOOKS bigger but uses the same memory. Now Clone it — memory jumps because it materializes the expanded data.
Experiment 3 — Dtype impact: Start with [6,8] in float32 (192 bytes). Switch to float16 (96 bytes). Switch to int8 (48 bytes). Same shape, 4× less memory. This is quantization in a nutshell.

The playground shows you the exact internal state PyTorch maintains for every tensor: a data pointer, shape tuple, strides tuple, dtype, and device. Every operation is just a transformation of this metadata — or, for copies, allocation of new memory + metadata.

Chapter 8: Sparse & Quantized — Beyond Dense Floats

Dense tensors waste memory when most elements are zero. A [10000, 10000] matrix with only 1% non-zero values stores 100 million floats — but only 1 million matter. Sparse tensors store only the non-zero elements and their coordinates.

PyTorch supports two sparse formats:

FormatStorageBest ForMemory
COO (Coordinate)List of (row, col, value) tuplesConstruction, random access3 × nnz
CSR (Compressed Sparse Row)Row pointers + col indices + valuesRow-wise operations, matmul2 × nnz + nrows
python
import torch

# COO format: specify indices and values
indices = torch.tensor([[0, 1, 2],    # row indices
                        [2, 0, 1]])   # col indices
values = torch.tensor([3.0, 4.0, 5.0])
sparse = torch.sparse_coo_tensor(indices, values, (3, 3))

# Dense version would use 9 floats = 36 bytes
# Sparse uses 6 ints (indices) + 3 floats (values) = 36 bytes
# Break-even! Sparse only wins when sparsity > ~67%

# CSR format (better for computation)
crow = torch.tensor([0, 1, 2, 3])   # row pointers
col = torch.tensor([2, 0, 1])      # column indices
vals = torch.tensor([3.0, 4.0, 5.0])
csr = torch.sparse_csr_tensor(crow, col, vals, (3, 3))
When to use sparse: Attention masks (causal mask is 50% zeros). Graph adjacency matrices (99%+ zeros). Embedding layers with rare tokens. Activation maps after ReLU (often 50-90% zeros). Rule of thumb: sparse format wins when >67% of elements are zero.

Quantized tensors represent floating-point values using integers plus a scale factor. The idea: if your weights range from -1.5 to +1.5, you can map this to int8 (-128 to +127) with scale = 3.0/255:

q = round(x / scale) + zero_point
x ≈ (q - zero_point) × scale
python
import torch

# Manual quantization
x = torch.randn(4, 4)  # float32 weights
scale = (x.max() - x.min()) / 255
zero_point = int((-x.min() / scale).round())
q = (x / scale + zero_point).round().clamp(0, 255).to(torch.uint8)

# Dequantize
x_approx = (q.float() - zero_point) * scale
print((x - x_approx).abs().max())  # max error ~ scale/2

# PyTorch quantization API
x_q = torch.quantize_per_tensor(x, scale=0.01, zero_point=128,
                                 dtype=torch.quint8)
print(x_q.int_repr())  # the underlying int8 values
Quantization types: Per-tensor uses one scale for the whole tensor (simplest, least accurate). Per-channel uses one scale per output channel (better for conv weights). Per-group (GPTQ, AWQ) uses one scale per N elements (best accuracy, used in LLM quantization). The finer the granularity, the better the approximation — but the more scale factors you store.
Sparsity vs Memory Savings

See how memory usage compares between dense and sparse formats as sparsity increases. The crossover point is where sparse wins.

Sparsity % 50%
Matrix size (N×N) 100
A 1000×1000 matrix has 95% zeros. Approximately how much memory does the COO sparse format save compared to dense float32?

Chapter 9: Mastery & Connections

You now understand tensors at the byte level — shape, dtype, device, strides, views, copies, broadcasting, and memory formats. This is the foundation every PyTorch operation is built on. Let's consolidate with reference tables and connections to the broader ML engineering stack.

Cheat Sheet: Creation Functions

FunctionDescriptionNotes
torch.zeros()All zerosSafe initialization
torch.ones()All onesMask initialization
torch.randn()Normal N(0,1)Weight initialization
torch.rand()Uniform [0,1)Dropout masks
torch.empty()UninitializedFastest (but dangerous!)
torch.arange()SequencePosition encodings
torch.eye()Identity matrixInitializations
torch.from_numpy()From numpySHARES memory!
torch.tensor()From dataAlways copies

Cheat Sheet: Memory Rules

RuleDetails
View operationsreshape, view, transpose, permute, slice, unsqueeze, expand
Copy operationsclone, contiguous (if needed), to(dtype), arithmetic ops
Contiguous checkStrides must equal product of subsequent dimensions
Stride=0 meansBroadcast expansion (no extra memory)
Memory = elements × bytes_per_elementAlways. No exceptions.

Cheat Sheet: Broadcasting Rules

StepRule
1. AlignRight-align shapes, pad shorter with 1s on left
2. CompareEach dim must match OR be 1
3. ExpandDims of size 1 stretch to match (stride=0, no copy)

Where This Goes Next

Autograd: Every tensor can track its computational history (requires_grad=True). When you call .backward(), PyTorch traverses this graph to compute gradients. Understanding tensor memory is prerequisite to understanding autograd — gradients are tensors too, and they double your memory usage.
Training loops: A training step is: load batch (CPU→GPU transfer), forward pass (GPU matmuls in float16), loss computation, backward pass (float32 gradients), optimizer step (update float32 master weights). Every concept from this lesson appears in that one sentence.
TopicBuilds OnWhere
Autograd & BackpropViews, memory, dtypeNext lesson
Mixed Precision Trainingfloat16, bfloat16, GPU transferTraining loop lesson
Model ParallelismDevice placement, memory budgetsDistributed lesson
Quantization (GPTQ, AWQ)int8, per-channel scalingInference lesson
Flash AttentionMemory layout, contiguityTransformer lesson
Closing thought: "The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise." — Edsger Dijkstra. Tensors are PyTorch's fundamental abstraction. Master them, and every higher-level concept (modules, optimizers, distributed training) becomes a composition of tensor operations you fully understand.
You have a [B, C, H, W] activation tensor in float32 on GPU. You want to save 50% memory without changing the computation. What do you do?