EE269 Lecture 26 — Mert Pilanci, Stanford

Transformers for Signal Prediction

Turn a continuous waveform into tokens, then let GPT-style autoregression predict the future.

Prerequisites: Attention & Transformers (Lecture 25) + Quantization basics. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Predict Signals?

You are monitoring a patient's heart rate. The sensor gives you a waveform — a continuous signal over time. Can you predict what happens next? If you could, you'd catch arrhythmias before they happen. You could compress audio by transmitting only the parts the predictor gets wrong. You could denoise by favoring likely continuations over noisy ones.

Signal prediction is one of the oldest problems in engineering. Traditional approaches — linear predictive coding, Wiener filters, ARMA models — assume the signal is generated by a linear process. They work beautifully on speech and simple time series. But what about signals with complex, nonlinear, long-range dependencies?

Here's the twist: GPT predicts the next word in a sentence. A word is just a token — a discrete symbol from a vocabulary. What if we turned our continuous signal into discrete tokens? Then we could use the exact same transformer architecture that powers ChatGPT to predict signals.

The core idea: Quantize a continuous signal into discrete levels, treat each level as a "word," and train a GPT-style transformer to predict the next token. The transformer learns the statistical structure of the signal — not from hand-crafted equations, but from data.

In this lesson we'll build every piece: how to turn a signal into tokens, why causal masking matters, how positional encoding gives the model a sense of time, and how teacher forcing trains the whole thing efficiently. The payoff: you'll draw a signal and watch a transformer predict its continuation.

The Prediction Problem

A noisy signal (orange) with its clean continuation (teal). Can we learn to predict the future from the past? Click New Signal to see different patterns.

Why can't we directly apply GPT to a continuous analog signal?

Chapter 1: Signal Tokenization

In NLP, tokenization splits "the cat sat" into ["the", "cat", "sat"] — discrete symbols from a vocabulary of ~50,000 words. For signals, we need to do something analogous: convert a continuous amplitude into a discrete token ID from a finite vocabulary of B levels.

Uniform Quantization

The simplest approach is uniform quantization. Given a signal with minimum value xmin and maximum value xmax, we divide the range into B evenly spaced levels. The step size (distance between adjacent levels) is:

Δ = (xmax − xmin) / (B − 1)

To quantize a sample x, we find the nearest level:

token(x) = round( (x − xmin) / Δ )

This gives us an integer in {0, 1, ..., B−1} — exactly like a word index in a vocabulary of size B. To reconstruct the signal value from a token:

x̂ = xmin + token · Δ
Vocabulary size = quantization resolution. With B = 256 levels (8-bit), we get CD-quality discretization for most signals. With B = 16, the "words" are coarse — the signal sounds choppy. The transformer's vocabulary size is literally the quantization depth. More levels = more faithful signal, but also a larger softmax over the vocabulary at each prediction step.

Worked Example

Signal samples: [0.2, 0.7, 0.4, 0.9, 0.1]. With B = 8 levels over [0, 1]:

Δ = (1 − 0) / (8 − 1) = 1/7 ≈ 0.143
Sample x(x − xmin) / ΔToken (round)Reconstructed x̂Error
0.2001.4010.143−0.057
0.7004.9050.714+0.014
0.4002.8030.429+0.029
0.9006.3060.857−0.043
0.1000.7010.143+0.043

Notice the maximum error is Δ/2 ≈ 0.071. This is the quantization noise floor. More levels means smaller Δ, less noise, but a larger token vocabulary for the transformer to handle.

Signal Tokenization Explorer

Adjust the vocabulary size B to see how quantization resolution affects the tokenized signal. The staircase is the reconstructed signal from tokens.

Vocab B 16
Think of it this way: a song is continuous air pressure. A CD player quantizes it to 65,536 levels (16-bit). We're doing the same thing, but with far fewer levels — and then treating the resulting integers as a "language" for the transformer to learn.
With B = 32 levels over the range [−1, 1], what is the quantization step size Δ?

Chapter 2: Causal Masking

In the previous lecture, we learned that self-attention lets every token attend to every other token. But there's a problem for prediction: if we want to predict token 5, we can't let the model see token 5 during training. That would be cheating — predicting the future using the future.

The solution is a causal mask (also called an autoregressive mask). It's a simple rule: token at position i can only attend to tokens at positions ≤ i. Position 3 sees tokens 0, 1, 2, 3. It cannot see tokens 4, 5, 6, ...

How It Works

The attention score matrix is n × n, where entry (i, j) measures how much token i attends to token j. The causal mask sets all entries where j > i to −∞ before the softmax. Since e−∞ = 0, those positions get zero attention weight.

Mask(i, j) = 0 if j ≤ i, −∞ if j > i
Attention = softmax( (QKT + Mask) / √dk ) · V

The result is a lower-triangular attention matrix. Each row sums to 1 (from softmax), but all weight is concentrated on current and past tokens.

Causal masking is what makes a transformer autoregressive. Without the mask, it's a bidirectional encoder (like BERT) — good for understanding, but can't generate. With the mask, it's a decoder (like GPT) — trained to predict what comes next, one token at a time. For signal prediction, we always use the causal mask.

Why Not Just Use Past Tokens as Input?

You might wonder: why not just feed the model tokens [0, ..., i−1] when predicting token i? The causal mask is clever because it lets us compute all predictions simultaneously during training. Position 0 predicts token 1. Position 1 predicts token 2. Position n−1 predicts token n. All in one forward pass. This is called teacher forcing (we'll explore it in Chapter 4).

Causal Mask Visualizer

Click a token position to see what it can attend to. Bright cells = allowed, dark cells = masked (−∞). Compare with the full (unmasked) attention.

Sequence Length 8
In a causally masked transformer with 6 tokens, how many attention entries (out of 36) are non-zero?

Chapter 3: Positional Encoding

Self-attention is permutation-equivariant: shuffle the input tokens, and the output shuffles the same way. The attention mechanism doesn't know that token 3 comes after token 2 — it treats the sequence like a set. For NLP, word order matters ("dog bites man" ≠ "man bites dog"). For signals, temporal order is everything.

The fix is to add a positional encoding to each token embedding. The model receives: (what this token is) + (where this token is in time).

Sinusoidal Encoding

The original Transformer paper (Vaswani et al., 2017) uses sine and cosine functions at different frequencies:

PE(pos, 2k) = sin(pos / 100002k/d)
PE(pos, 2k+1) = cos(pos / 100002k/d)

where pos is the position index and k is the dimension index. Each dimension oscillates at a different frequency. Low dimensions change rapidly (high frequency); high dimensions change slowly (low frequency). Together, they give each position a unique fingerprint.

Why sines and cosines? Two reasons. First, each position gets a unique vector — no two positions share the same encoding. Second, relative positions can be expressed as linear transformations of these encodings: PE(pos + k) is a rotation of PE(pos) by a fixed angle that depends only on k. This means the model can learn to attend to "3 steps ago" regardless of absolute position.

For Signals: Position = Time Step

In NLP, position 0 is the first word. In signal prediction, position 0 is the first sample (time t=0). Position 1 is the sample at t = Ts (one sampling period later). The positional encoding gives the transformer a clock — it knows not just what each token value is, but when in the sequence it occurred.

Signal sample x[n]
Continuous value at time n
↓ quantize
Token ID ∈ {0, ..., B−1}
Discrete vocabulary index
↓ embedding lookup
Token embedding ∈ Rd
Learned dense vector
↓ + PE(n)
Input to transformer ∈ Rd
What + Where

Learned vs. Fixed Positional Encodings

Some models (GPT-2, GPT-3) learn the positional encoding as a trainable parameter matrix instead of using fixed sinusoids. Both work. For signals with a fixed maximum context length, learned encodings are common. For variable-length sequences or extrapolation to longer signals, sinusoidal encodings generalize better.

Sinusoidal Positional Encoding

Each row is a position (time step). Each column is a dimension. Color intensity = encoding value. Notice: low dimensions (left) oscillate fast, high dimensions (right) oscillate slowly.

Dimensions d 32
Positions 32
Why does a signal-prediction transformer need positional encoding?

Chapter 4: Teacher Forcing

We now have all the ingredients: tokenized signals, causal masking, and positional encoding. How do we actually train this model? The answer is teacher forcing — one of the most elegant tricks in sequence modeling.

The Training Setup

Given a sequence of tokens [t0, t1, ..., tN−1], the training target is simple: predict the next token at every position.

Input (sees)Target (predicts)
[t0]t1
[t0, t1]t2
[t0, t1, t2]t3
[t0, ..., tN−2]tN−1

Thanks to the causal mask, we get all N−1 predictions in one forward pass. Position i's output only depends on positions 0..i, so the prediction at position i is a valid prediction of token i+1. No sequential loop needed during training.

The Loss Function

At each position i, the model outputs a probability distribution over all B tokens (via softmax). The loss is cross-entropy between the predicted distribution and the true next token:

L = − (1/N) ∑i=0N−1 log P(ti+1 | t0, ..., ti)

This is identical to the language model loss in GPT. The model learns to assign high probability to the actual next signal token.

Why "teacher" forcing? During training, the model always sees the ground truth tokens as input — even if its own prediction at the previous step was wrong. The "teacher" (the real signal) forces the correct context. This prevents error accumulation during training. At inference time, we don't have a teacher — we feed the model's own predictions back in (autoregressive generation). This train-test mismatch is a known issue called exposure bias.

Data Flow During Training

Signal [x0, ..., xN]
N+1 continuous samples
↓ quantize
Tokens [t0, ..., tN]
N+1 integers in {0..B−1}
↓ embed + positional encoding
Transformer input (N+1 × d)
Each row = token embed + PE
↓ causal masked self-attention × L layers
Logits (N+1 × B)
Score for each vocabulary token at each position
↓ cross-entropy loss vs [t1, ..., tN]
Loss scalar
Backpropagate and update weights
Teacher Forcing Simulator

Watch how the model sees ground truth inputs (orange) and makes predictions (teal) at each position. Slide through training steps to see predictions improve.

Training Step 0
During teacher forcing, what does the model receive as input at position i?

Chapter 5: Autoregressive Generation

Training is done. The model has learned to predict the next signal token given all past tokens. Now we switch to inference mode: generating new signal samples by feeding predictions back into the model.

The Generation Loop

Given a context (a sequence of observed signal tokens), generation works one step at a time:

Context [t0, ..., tk]
Known past signal tokens
↓ forward pass through transformer
P(tk+1 | t0..tk)
Distribution over B possible next tokens
↓ sample or argmax
k+1
Predicted next token
↓ append to context
[t0, ..., tk, t̂k+1]
Context grows by 1
↻ repeat for next position

Sampling Strategies

The model outputs a probability distribution over B tokens. How do we pick the next token?

StrategyHowProperties
Greedy (argmax)Pick the highest-probability tokenDeterministic, safe, but can be repetitive
Temperature samplingDivide logits by T, then sample from softmaxT<1: more peaked (conservative). T>1: flatter (creative)
Top-kKeep only the k highest-probability tokens, renormalize, samplePrevents sampling very unlikely tokens
Temperature for signals. For speech synthesis, low temperature (0.7–0.9) preserves the signal's natural dynamics. High temperature (1.2+) adds variety but can create unrealistic jumps. For music generation, higher temperature creates more surprising compositions. The right temperature depends on your application.

The Exposure Bias Problem

During training, the model always saw ground truth context. During generation, it sees its own predictions — including mistakes. One wrong prediction shifts the context, leading to more errors. This is exposure bias. For short predictions (5–20 steps), it's usually fine. For long generations (hundreds of steps), the signal can drift into unrealistic territory.

Mitigation strategies include scheduled sampling (gradually replacing teacher tokens with model predictions during training) and beam search (maintaining multiple candidate sequences).

Autoregressive Generation

Watch a model generate signal tokens one at a time. The orange is the known context; teal is generated. Adjust temperature to see how it affects the continuation.

Temperature 1.0
What does lowering the sampling temperature (T < 1) do to the generated signal?

Chapter 6: AR Models vs Transformers

Autoregressive signal prediction isn't new. Classical AR(p) models have been used since the 1920s. So what does the transformer bring to the table?

Classical AR(p) Model

An AR model of order p predicts the next sample as a linear combination of the past p samples:

x̂[n] = ∑k=1p ak · x[n−k] + ε[n]

where a1, ..., ap are fixed coefficients and ε is white noise. This is elegant and efficient, but it assumes the signal is generated by a linear process. It works for speech (which is well-modeled as a linear filter driven by noise), but struggles with complex, nonlinear patterns.

Transformer Autoregressive Model

The transformer does the same thing conceptually — predict the next token from the past — but with key differences:

PropertyAR(p)Transformer
PredictionLinear combination of past p valuesNonlinear function of entire context
ContextFixed window of p samplesFull sequence (up to context length)
DependenciesShort-range onlyLong-range via attention
Signal typeContinuous valuesQuantized discrete tokens
OutputSingle predicted valueFull distribution over vocabulary
Training dataOne signal (fit coefficients)Many signals (learn general patterns)
Parametersp coefficientsMillions (embedding + attention + FFN)
The distribution matters. An AR(p) model outputs one number: the predicted amplitude. A transformer outputs a probability distribution over all possible next tokens. This is powerful because many signals are inherently multimodal — at a silence-to-speech boundary, the next sample could be any of several phonemes. The transformer can express this uncertainty; the AR model cannot.

When to Use What

Use AR(p) when: the signal is approximately linear, you have limited data, you need real-time performance (p multiplications vs. a full transformer forward pass), or you need interpretable coefficients (e.g., formant frequencies from LPC coefficients).

Use transformers when: the signal has complex nonlinear structure, you have lots of training data, you want to model uncertainty (multi-modal predictions), or you're generating creative outputs (music, speech synthesis).

AR(p) vs Transformer Prediction

Compare a linear AR(4) model (purple) with a simulated transformer prediction (teal) on a nonlinear signal. The AR model captures the trend but misses the fine structure.

A key advantage of transformer-based signal prediction over AR(p) is:

Chapter 7: Signal Predictor Showcase

Time to put it all together. Below is an interactive signal predictor. Draw a signal on the left canvas (click and drag), and a simulated transformer will predict its continuation on the right. You can adjust the vocabulary size, temperature, and prediction length.

How this works under the hood: Your drawn signal is quantized into B tokens using uniform quantization. A simplified autoregressive model (using local pattern matching on the tokenized signal — a lightweight proxy for full transformer attention) predicts the next tokens one by one. Each predicted token is dequantized back to a signal value. The result is a plausible continuation that respects the patterns in your drawing.
Draw & Predict

Draw a signal by clicking and dragging on the canvas. Then click Predict to see the transformer-style continuation. Try smooth waves, sharp spikes, or random noise — see how the predictor responds to each.

Vocab B 32
Temperature 0.8
Predict Steps 30
Token Sequence View

The same signal shown as discrete tokens. Orange = drawn (context), teal = predicted. Notice the quantization staircase — the transformer operates on these discrete levels, not the smooth curve.

Prediction Confidence

At each predicted step, the model outputs a distribution over all B tokens. The bar chart shows this distribution for the selected step. Peaked = confident. Flat = uncertain.

Pred Step 1