Vision Foundations & Image Representations
How images become tensors — pixels, channels, convolutional feature hierarchies, receptive fields, feature pyramids, and the representations that vision models learn.
How images and text merge into a single intelligence — from the pixel-level mechanics of vision transformers to CLIP's contrastive magic, multimodal fusion architectures, and the spatial grounding that lets models see and reason. Built for engineers who want to understand, not just use.
How images become tensors — pixels, channels, convolutional feature hierarchies, receptive fields, feature pyramids, and the representations that vision models learn.
Patch embeddings, the ViT architecture, CLS tokens vs mean pooling, DINOv2's self-supervised features, SigLIP, and how vision transformers scale.
Contrastive objectives, the InfoNCE loss, CLIP's dual-encoder training on 400M image-text pairs, zero-shot classification, and the shared embedding space.
How vision connects to language — cross-attention, projection layers, early vs late fusion, Flamingo's gated cross-attention, and the LLaVA connector designs.
LLaVA's two-stage recipe, visual conversation formats, instruction-following datasets, GPT-4V-assisted data generation, and the leap from classification to chat.
Multi-stage pre-training, data curation and filtering, resolution scaling, interleaved image-text training, RLHF for VLMs, and scaling laws for multimodal models.
Referring expressions, bounding box prediction, object grounding, spatial relationship understanding, region-level features, and visually-grounded question answering.
Document and chart understanding, video VLMs, GUI agents, medical imaging, OCR and structured extraction, and the path toward unified multimodal intelligence.