From numpy arrays to GPU tensors, broadcasting, views vs copies, memory layout, and dtype precision — the byte-level engineering that makes deep learning possible.
You want to run Llama-2 7B. The model has 7 billion parameters. Each parameter is a float32 number — 4 bytes. That's 7,000,000,000 × 4 = 28 GB just for the weights. Your GPU has 24 GB of VRAM. The model literally does not fit.
But wait — if you store each parameter as a float16 (2 bytes), that's 14 GB. int8 (1 byte)? 7 GB. Suddenly it fits with room for activations. This isn't a theoretical exercise. Understanding tensor memory at the byte level is the difference between "my model runs" and "CUDA out of memory."
A neural network is just matrix multiplications and element-wise functions. But those matrices live in RAM. GPUs have their own separate memory. And the precision you choose (float32, float16, int8) determines both memory usage AND numerical accuracy. Managing this IS the engineering challenge of deep learning.
See how dtype choice determines whether a model fits in GPU memory. The red line is your GPU's VRAM limit.
The visualization makes it visceral: dtype choice is life or death for your training run. A 70B model in float32 needs 280 GB — no single GPU on Earth can hold it. In int4? 35 GB. That fits on an A100.
A tensor is a multi-dimensional array with three essential properties: its shape (how many elements along each dimension), its dtype (how each element is stored in memory), and its device (which hardware holds the data). Every operation in PyTorch flows through these three facts.
Let's create our first tensor and interrogate its properties:
python import torch # Create a 2x3 tensor of zeros x = torch.zeros(2, 3) print(x.shape) # torch.Size([2, 3]) print(x.dtype) # torch.float32 print(x.device) # cpu # Memory usage: 2 * 3 * 4 bytes = 24 bytes print(x.element_size()) # 4 (bytes per float32) print(x.nelement()) # 6 (total elements) print(x.element_size() * x.nelement()) # 24 bytes
Let's break this down. A shape of [2, 3] means 2 rows and 3 columns — 6 elements total. Each element is float32 (the default), which takes 4 bytes. So the entire tensor occupies exactly 2 × 3 × 4 = 24 bytes of contiguous memory.
Here are the most common creation functions:
python # Common creation functions a = torch.zeros(3, 4) # all zeros b = torch.ones(3, 4) # all ones c = torch.randn(3, 4) # normal distribution N(0,1) d = torch.rand(3, 4) # uniform [0, 1) e = torch.arange(0, 12) # [0, 1, 2, ..., 11] f = torch.tensor([[1,2],[3,4]]) # from Python list g = torch.empty(3, 4) # uninitialized (garbage values!) # From numpy (SHARES memory by default!) import numpy as np arr = np.array([1.0, 2.0, 3.0]) t = torch.from_numpy(arr) # shared memory: modify one, both change
torch.from_numpy() creates a tensor that SHARES memory with the numpy array. If you modify the numpy array, the tensor changes too. Use torch.tensor(arr) for an independent copy.Worked example: how much memory does a batch of 32 RGB images at 224×224 resolution use?
Enter dimensions and see memory usage for different dtypes. Each colored bar shows bytes consumed.
Your computer's memory is one-dimensional — a long tape of bytes at consecutive addresses. But tensors are multi-dimensional. How does a 2D matrix live in a 1D tape? The answer is strides.
A stride tells you how many elements to skip in memory to move one step along each dimension. For a [3, 4] tensor stored row-by-row (the default, called row-major or C-contiguous order), the strides are (4, 1). That means: to move to the next row, skip 4 elements. To move to the next column, skip 1 element.
python import torch x = torch.arange(12).reshape(3, 4) print(x) # tensor([[ 0, 1, 2, 3], # [ 4, 5, 6, 7], # [ 8, 9, 10, 11]]) print(x.stride()) # (4, 1) # stride[0]=4: jump 4 elements to go from row i to row i+1 # stride[1]=1: jump 1 element to go from col j to col j+1 # Element at [i, j] lives at memory offset: i*stride[0] + j*stride[1] # x[2, 1] is at offset 2*4 + 1*1 = 9 --> value is 9 ✓
Column-major (Fortran-contiguous) is the opposite: elements in the same column are contiguous. For a [3, 4] tensor stored column-major, strides would be (1, 3) — jump 1 to move down a row, jump 3 to move right a column. PyTorch defaults to row-major. NumPy lets you choose.
Here's the key insight: transposing a tensor just swaps the strides. No data is copied.
python x = torch.arange(12).reshape(3, 4) print(x.stride()) # (4, 1) -- row-major y = x.t() # transpose print(y.shape) # torch.Size([4, 3]) print(y.stride()) # (1, 4) -- strides swapped! print(y.is_contiguous()) # False # Same underlying data, different view print(x.data_ptr() == y.data_ptr()) # True -- same memory!
Click any cell in the 2D grid to see which memory address it maps to. Change strides to see how indexing changes. Orange = selected cell. Teal = memory position.
.contiguous() exists — it rearranges data for fast sequential access.When you reshape, transpose, or slice a tensor in PyTorch, does it copy the data? Usually no. Most operations create a view — a new tensor object that points to the SAME underlying memory with different metadata (shape, strides, offset). This is fast (O(1) instead of O(n)) but dangerous: modifying the view modifies the original.
Operations that create views (shared memory):
python x = torch.arange(12).reshape(3, 4) # ALL of these share memory with x: a = x.view(6, 2) # reshape (must be contiguous) b = x.reshape(6, 2) # reshape (may copy if non-contiguous) c = x.t() # transpose d = x[0] # indexing a row e = x[:, 1] # indexing a column f = x.expand(5,3,4) # broadcast expand (stride=0!) g = x.permute(1,0) # dimension reorder h = x.unsqueeze(0) # add dimension # Proof: modify the view, original changes a[0, 0] = 999 print(x[0, 0]) # 999 -- x changed too!
Operations that create copies (independent memory):
python # These allocate NEW memory: a = x.clone() # explicit deep copy b = x.contiguous() # copy only if non-contiguous c = x + 0 # any arithmetic creates new tensor d = x.to(torch.float16) # dtype cast = new allocation e = torch.tensor(x) # explicit new tensor from data
.reshape() returns a view when possible but silently copies when it can't (non-contiguous input). .view() is strict — it raises an error if a view isn't possible. Use .view() when you NEED shared memory, .reshape() when you don't care.The key question: when is a tensor non-contiguous? After any operation that changes strides without moving data:
python x = torch.arange(12).reshape(3, 4) print(x.is_contiguous()) # True -- strides (4,1), data is sequential y = x.t() # transpose: shape [4,3], strides (1,4) print(y.is_contiguous()) # False! # Why? For y to be contiguous with shape [4,3], strides must be (3,1) # But y has strides (1,4) -- elements aren't sequential in memory # Fix it: z = y.contiguous() # copies data into new sequential memory print(z.stride()) # (3, 1) -- now contiguous print(z.data_ptr() == x.data_ptr()) # False -- different memory
Watch how operations affect memory sharing. Teal boxes share memory (view). Orange boxes have their own memory (copy). Click operations to apply them.
You have a [3, 4] matrix and want to add a [4] vector to each row. In raw code, you'd write a loop. But PyTorch does it automatically through broadcasting — the rules for stretching tensors to compatible shapes for element-wise operations.
The broadcasting rules are simple but must be followed exactly:
python import torch # Example 1: [3,4] + [4] --> works! A = torch.ones(3, 4) # shape [3, 4] b = torch.arange(4) # shape [4] # Step 1: pad b to [1, 4] # Step 2: dim 0: 3 vs 1 ✓ (1 stretches) dim 1: 4 vs 4 ✓ (match) # Result: [3, 4] result = A + b # b is added to each row # Example 2: [3,1] + [1,4] --> [3,4]! col = torch.tensor([[1],[2],[3]]) # shape [3, 1] row = torch.tensor([[10,20,30,40]]) # shape [1, 4] # dim 0: 3 vs 1 ✓ dim 1: 1 vs 4 ✓ result = col + row # outer sum! shape [3, 4] # [[11,21,31,41], # [12,22,32,42], # [13,23,33,43]] # Example 3: [3,4] + [3] --> ERROR! # Step 1: pad to [1, 3] # Step 2: dim 1: 4 vs 3 -- neither is 1! FAIL
vec.unsqueeze(1) or vec[:, None].Broadcasting is lazy — it never actually copies data. Under the hood, it uses stride=0 for stretched dimensions. A [3, 1] tensor broadcast to [3, 4] has strides (1, 0) — moving along the stretched dimension doesn't advance in memory. Zero memory cost!
python x = torch.tensor([[1],[2],[3]]) # shape [3,1], strides (1,1) y = x.expand(3, 4) # shape [3,4], strides (1,0)! print(y.stride()) # (1, 0) -- stride 0 means "repeat this element" print(y) # tensor([[1, 1, 1, 1], # [2, 2, 2, 2], # [3, 3, 3, 3]]) # Only 3 elements in memory, but LOOKS like 12!
Watch how two tensors align and stretch for an element-wise add. Orange = tensor A. Teal = tensor B. Purple = result.
expand() calls — or worse, loops.A dtype (data type) determines how each number is encoded in binary. More bits = more precision = more memory. The four dtypes you'll encounter in deep learning are:
| Dtype | Bits | Bytes | Range | Precision | Use Case |
|---|---|---|---|---|---|
| float32 | 32 | 4 | ±3.4×1038 | ~7 decimal digits | Default training |
| float16 | 16 | 2 | ±65,504 | ~3 decimal digits | Mixed precision fwd |
| bfloat16 | 16 | 2 | ±3.4×1038 | ~2 decimal digits | Training (Google TPU) |
| int8 | 8 | 1 | -128 to 127 | Exact integers | Quantized inference |
The floating-point format is: (-1)sign × 2exponent × (1 + mantissa). More exponent bits = bigger range. More mantissa bits = finer granularity between representable numbers.
python import torch # Dtype casting x = torch.randn(3, 3) # float32 by default x16 = x.half() # float16 xbf = x.bfloat16() # bfloat16 x8 = x.to(torch.int8) # int8 (truncates!) # float16 overflow danger: big = torch.tensor(70000.0) print(big.half()) # tensor(inf) -- overflow! 70000 > 65504 print(big.bfloat16()) # tensor(69632.) -- representable (lower precision) # Mixed precision training pattern: model = model.float() # master weights in float32 with torch.cuda.amp.autocast(): # Forward pass in float16 (2x faster on GPU) output = model(input.half()) loss = criterion(output, target) # Backward in float32 for numerical stability scaler.scale(loss).backward()
The same sine wave stored at different precisions. Watch where float16 clips (amplitude > 65504) and where int8 staircase appears.
Notice: at amplitude=100, all dtypes look the same. At amplitude=70000, float16 clips to infinity while bfloat16 tracks correctly (with visible stairstepping from low precision). At any amplitude, int8 looks like a staircase because it can only represent 256 distinct values.
Your CPU has RAM (system memory). Your GPU has VRAM (video memory). These are physically separate pools of memory connected by a PCIe bus (or NVLink on high-end systems). Moving data between them is the #1 performance bottleneck in deep learning.
A PCIe 4.0 x16 bus transfers at ~25 GB/s. Sounds fast? An A100 GPU computes matmul at 312 TFLOPS. For a [4096, 4096] float16 matrix (32 MB), transfer takes 1.3 ms but the matmul takes 0.04 ms. The transfer is 30x slower than the computation.
python import torch # Move tensor to GPU x = torch.randn(1000, 1000) x_gpu = x.to('cuda') # copies to GPU memory x_gpu = x.cuda() # same thing x_gpu = x.to('cuda:0') # specific GPU # Create directly on GPU (no transfer!) y = torch.randn(1000, 1000, device='cuda') # Move back to CPU x_cpu = x_gpu.cpu() x_cpu = x_gpu.to('cpu') # DANGER: can't mix devices! # x + x_gpu --> RuntimeError: tensors on different devices
.to('cuda') and .cpu() call triggers a synchronous transfer. If you do this inside a training loop (e.g., moving labels to GPU each batch), you stall the GPU waiting for data. Pre-load everything to GPU ONCE, or use pin_memory=True in your DataLoader.Pinned memory is a special CPU memory region that the GPU can access directly via DMA (Direct Memory Access). Normal CPU memory might be swapped to disk, so the driver must first copy it to a pinned staging area. If YOU pin it upfront, the driver skips that step:
python # Pinned memory for faster transfers x = torch.randn(1000, 1000).pin_memory() x_gpu = x.to('cuda', non_blocking=True) # async transfer! # DataLoader with pinned memory loader = DataLoader(dataset, batch_size=32, pin_memory=True, # allocate batches in pinned RAM num_workers=4) # parallel data loading # In training loop: for batch, labels in loader: batch = batch.to('cuda', non_blocking=True) labels = labels.to('cuda', non_blocking=True) # GPU compute overlaps with next batch's transfer
See where time is spent: transfer vs compute. Increase tensor size to see transfer dominate. Enable pinned memory to see overlap.
device='cuda' instead of creating on CPU and calling .cuda()?Time to put it all together. This interactive playground lets you create tensors, apply operations, and see exactly what happens to shape, strides, memory, and contiguity at each step. Think of it as a visual debugger for tensor operations.
Build a chain of tensor operations. Watch shape, strides, contiguity, and memory change with each step.
Experiments to try in the playground:
The playground shows you the exact internal state PyTorch maintains for every tensor: a data pointer, shape tuple, strides tuple, dtype, and device. Every operation is just a transformation of this metadata — or, for copies, allocation of new memory + metadata.
Dense tensors waste memory when most elements are zero. A [10000, 10000] matrix with only 1% non-zero values stores 100 million floats — but only 1 million matter. Sparse tensors store only the non-zero elements and their coordinates.
PyTorch supports two sparse formats:
| Format | Storage | Best For | Memory |
|---|---|---|---|
| COO (Coordinate) | List of (row, col, value) tuples | Construction, random access | 3 × nnz |
| CSR (Compressed Sparse Row) | Row pointers + col indices + values | Row-wise operations, matmul | 2 × nnz + nrows |
python import torch # COO format: specify indices and values indices = torch.tensor([[0, 1, 2], # row indices [2, 0, 1]]) # col indices values = torch.tensor([3.0, 4.0, 5.0]) sparse = torch.sparse_coo_tensor(indices, values, (3, 3)) # Dense version would use 9 floats = 36 bytes # Sparse uses 6 ints (indices) + 3 floats (values) = 36 bytes # Break-even! Sparse only wins when sparsity > ~67% # CSR format (better for computation) crow = torch.tensor([0, 1, 2, 3]) # row pointers col = torch.tensor([2, 0, 1]) # column indices vals = torch.tensor([3.0, 4.0, 5.0]) csr = torch.sparse_csr_tensor(crow, col, vals, (3, 3))
Quantized tensors represent floating-point values using integers plus a scale factor. The idea: if your weights range from -1.5 to +1.5, you can map this to int8 (-128 to +127) with scale = 3.0/255:
python import torch # Manual quantization x = torch.randn(4, 4) # float32 weights scale = (x.max() - x.min()) / 255 zero_point = int((-x.min() / scale).round()) q = (x / scale + zero_point).round().clamp(0, 255).to(torch.uint8) # Dequantize x_approx = (q.float() - zero_point) * scale print((x - x_approx).abs().max()) # max error ~ scale/2 # PyTorch quantization API x_q = torch.quantize_per_tensor(x, scale=0.01, zero_point=128, dtype=torch.quint8) print(x_q.int_repr()) # the underlying int8 values
See how memory usage compares between dense and sparse formats as sparsity increases. The crossover point is where sparse wins.
You now understand tensors at the byte level — shape, dtype, device, strides, views, copies, broadcasting, and memory formats. This is the foundation every PyTorch operation is built on. Let's consolidate with reference tables and connections to the broader ML engineering stack.
| Function | Description | Notes |
|---|---|---|
torch.zeros() | All zeros | Safe initialization |
torch.ones() | All ones | Mask initialization |
torch.randn() | Normal N(0,1) | Weight initialization |
torch.rand() | Uniform [0,1) | Dropout masks |
torch.empty() | Uninitialized | Fastest (but dangerous!) |
torch.arange() | Sequence | Position encodings |
torch.eye() | Identity matrix | Initializations |
torch.from_numpy() | From numpy | SHARES memory! |
torch.tensor() | From data | Always copies |
| Rule | Details |
|---|---|
| View operations | reshape, view, transpose, permute, slice, unsqueeze, expand |
| Copy operations | clone, contiguous (if needed), to(dtype), arithmetic ops |
| Contiguous check | Strides must equal product of subsequent dimensions |
| Stride=0 means | Broadcast expansion (no extra memory) |
| Memory = elements × bytes_per_element | Always. No exceptions. |
| Step | Rule |
|---|---|
| 1. Align | Right-align shapes, pad shorter with 1s on left |
| 2. Compare | Each dim must match OR be 1 |
| 3. Expand | Dims of size 1 stretch to match (stride=0, no copy) |
requires_grad=True). When you call .backward(), PyTorch traverses this graph to compute gradients. Understanding tensor memory is prerequisite to understanding autograd — gradients are tensors too, and they double your memory usage.| Topic | Builds On | Where |
|---|---|---|
| Autograd & Backprop | Views, memory, dtype | Next lesson |
| Mixed Precision Training | float16, bfloat16, GPU transfer | Training loop lesson |
| Model Parallelism | Device placement, memory budgets | Distributed lesson |
| Quantization (GPTQ, AWQ) | int8, per-channel scaling | Inference lesson |
| Flash Attention | Memory layout, contiguity | Transformer lesson |