EE269 Lecture 26 — Transformers for Signal Prediction

Chapter 0: Why Predict Signals?

You are monitoring a patient's heart rate. The sensor gives you a waveform — a continuous signal over time. Can you predict what happens next? If you could, you'd catch arrhythmias before they happen. You could compress audio by transmitting only the parts the predictor gets wrong. You could denoise by favoring likely continuations over noisy ones.

Signal prediction is one of the oldest problems in engineering. Traditional approaches — linear predictive coding, Wiener filters, ARMA models — assume the signal is generated by a linear process. They work beautifully on speech and simple time series. But what about signals with complex, nonlinear, long-range dependencies?

Here's the twist: GPT predicts the next word in a sentence. A word is just a token — a discrete symbol from a vocabulary. What if we turned our continuous signal into discrete tokens? Then we could use the exact same transformer architecture that powers ChatGPT to predict signals.

The core idea: Quantize a continuous signal into discrete levels, treat each level as a "word," and train a GPT-style transformer to predict the next token. The transformer learns the statistical structure of the signal — not from hand-crafted equations, but from data.

In this lesson we'll build every piece: how to turn a signal into tokens, why causal masking matters, how positional encoding gives the model a sense of time, and how teacher forcing trains the whole thing efficiently. The payoff: you'll draw a signal and watch a transformer predict its continuation.

The Prediction Problem

A noisy signal (orange) with its clean continuation (teal). Can we learn to predict the future from the past? Click New Signal to see different patterns.

Why can't we directly apply GPT to a continuous analog signal?

GPT is too slow for real-time signals GPT operates on discrete tokens from a finite vocabulary, but signals are continuous Signals are 1D but GPT requires 2D inputs

Chapter 1: Signal Tokenization

In NLP, tokenization splits "the cat sat" into ["the", "cat", "sat"] — discrete symbols from a vocabulary of ~50,000 words. For signals, we need to do something analogous: convert a continuous amplitude into a discrete token ID from a finite vocabulary of B levels.

Uniform Quantization

The simplest approach is uniform quantization. Given a signal with minimum value x_min and maximum value x_max, we divide the range into B evenly spaced levels. The step size (distance between adjacent levels) is:

Δ = (x_max − x_min) / (B − 1)

To quantize a sample x, we find the nearest level:

token(x) = round( (x − x_min) / Δ )

This gives us an integer in {0, 1, ..., B−1} — exactly like a word index in a vocabulary of size B. To reconstruct the signal value from a token:

x̂ = x_min + token · Δ

Vocabulary size = quantization resolution. With B = 256 levels (8-bit), we get CD-quality discretization for most signals. With B = 16, the "words" are coarse — the signal sounds choppy. The transformer's vocabulary size is literally the quantization depth. More levels = more faithful signal, but also a larger softmax over the vocabulary at each prediction step.

Worked Example

Signal samples: [0.2, 0.7, 0.4, 0.9, 0.1]. With B = 8 levels over [0, 1]:

Δ = (1 − 0) / (8 − 1) = 1/7 ≈ 0.143

Sample x	(x − x_min) / Δ	Token (round)	Reconstructed x̂	Error
0.200	1.40	1	0.143	−0.057
0.700	4.90	5	0.714	+0.014
0.400	2.80	3	0.429	+0.029
0.900	6.30	6	0.857	−0.043
0.100	0.70	1	0.143	+0.043

Notice the maximum error is Δ/2 ≈ 0.071. This is the quantization noise floor. More levels means smaller Δ, less noise, but a larger token vocabulary for the transformer to handle.

Signal Tokenization Explorer

Adjust the vocabulary size B to see how quantization resolution affects the tokenized signal. The staircase is the reconstructed signal from tokens.

Vocab B 16

Think of it this way: a song is continuous air pressure. A CD player quantizes it to 65,536 levels (16-bit). We're doing the same thing, but with far fewer levels — and then treating the resulting integers as a "language" for the transformer to learn.

With B = 32 levels over the range [−1, 1], what is the quantization step size Δ?

0.032 2/31 ≈ 0.0645 2/32 = 0.0625

Chapter 2: Causal Masking

In the previous lecture, we learned that self-attention lets every token attend to every other token. But there's a problem for prediction: if we want to predict token 5, we can't let the model see token 5 during training. That would be cheating — predicting the future using the future.

The solution is a causal mask (also called an autoregressive mask). It's a simple rule: token at position i can only attend to tokens at positions ≤ i. Position 3 sees tokens 0, 1, 2, 3. It cannot see tokens 4, 5, 6, ...

How It Works

The attention score matrix is n × n, where entry (i, j) measures how much token i attends to token j. The causal mask sets all entries where j > i to −∞ before the softmax. Since e^−∞ = 0, those positions get zero attention weight.

Mask(i, j) = 0 if j ≤ i, −∞ if j > i

Attention = softmax( (QK^T + Mask) / √d_k ) · V

The result is a lower-triangular attention matrix. Each row sums to 1 (from softmax), but all weight is concentrated on current and past tokens.

Causal masking is what makes a transformer autoregressive. Without the mask, it's a bidirectional encoder (like BERT) — good for understanding, but can't generate. With the mask, it's a decoder (like GPT) — trained to predict what comes next, one token at a time. For signal prediction, we always use the causal mask.

Why Not Just Use Past Tokens as Input?

You might wonder: why not just feed the model tokens [0, ..., i−1] when predicting token i? The causal mask is clever because it lets us compute all predictions simultaneously during training. Position 0 predicts token 1. Position 1 predicts token 2. Position n−1 predicts token n. All in one forward pass. This is called teacher forcing (we'll explore it in Chapter 4).

Causal Mask Visualizer

Click a token position to see what it can attend to. Bright cells = allowed, dark cells = masked (−∞). Compare with the full (unmasked) attention.

Sequence Length 8

In a causally masked transformer with 6 tokens, how many attention entries (out of 36) are non-zero?

18 15 21 (the lower triangle including diagonal: n(n+1)/2)

Chapter 3: Positional Encoding

Self-attention is permutation-equivariant: shuffle the input tokens, and the output shuffles the same way. The attention mechanism doesn't know that token 3 comes after token 2 — it treats the sequence like a set. For NLP, word order matters ("dog bites man" ≠ "man bites dog"). For signals, temporal order is everything.

The fix is to add a positional encoding to each token embedding. The model receives: (what this token is) + (where this token is in time).

Sinusoidal Encoding

The original Transformer paper (Vaswani et al., 2017) uses sine and cosine functions at different frequencies:

PE(pos, 2k) = sin(pos / 10000^2k/d)

PE(pos, 2k+1) = cos(pos / 10000^2k/d)

where pos is the position index and k is the dimension index. Each dimension oscillates at a different frequency. Low dimensions change rapidly (high frequency); high dimensions change slowly (low frequency). Together, they give each position a unique fingerprint.

Why sines and cosines? Two reasons. First, each position gets a unique vector — no two positions share the same encoding. Second, relative positions can be expressed as linear transformations of these encodings: PE(pos + k) is a rotation of PE(pos) by a fixed angle that depends only on k. This means the model can learn to attend to "3 steps ago" regardless of absolute position.

For Signals: Position = Time Step

In NLP, position 0 is the first word. In signal prediction, position 0 is the first sample (time t=0). Position 1 is the sample at t = T_s (one sampling period later). The positional encoding gives the transformer a clock — it knows not just what each token value is, but when in the sequence it occurred.

Signal sample x[n]

Continuous value at time n

↓ quantize

Token ID ∈ {0, ..., B−1}

Discrete vocabulary index

↓ embedding lookup

Token embedding ∈ R^d

Learned dense vector

↓ + PE(n)

Input to transformer ∈ R^d

What + Where

Learned vs. Fixed Positional Encodings

Some models (GPT-2, GPT-3) learn the positional encoding as a trainable parameter matrix instead of using fixed sinusoids. Both work. For signals with a fixed maximum context length, learned encodings are common. For variable-length sequences or extrapolation to longer signals, sinusoidal encodings generalize better.

Sinusoidal Positional Encoding

Each row is a position (time step). Each column is a dimension. Color intensity = encoding value. Notice: low dimensions (left) oscillate fast, high dimensions (right) oscillate slowly.

Dimensions d 32

Positions 32

Why does a signal-prediction transformer need positional encoding?

Self-attention is permutation-equivariant — without position info, the model can't distinguish "earlier" from "later" Positional encoding reduces the number of parameters It increases the vocabulary size for better resolution

Chapter 4: Teacher Forcing

We now have all the ingredients: tokenized signals, causal masking, and positional encoding. How do we actually train this model? The answer is teacher forcing — one of the most elegant tricks in sequence modeling.

The Training Setup

Given a sequence of tokens [t₀, t₁, ..., t_N−1], the training target is simple: predict the next token at every position.

Input (sees)	Target (predicts)
[t₀]	t₁
[t₀, t₁]	t₂
[t₀, t₁, t₂]	t₃
[t₀, ..., t_N−2]	t_N−1

Thanks to the causal mask, we get all N−1 predictions in one forward pass. Position i's output only depends on positions 0..i, so the prediction at position i is a valid prediction of token i+1. No sequential loop needed during training.

The Loss Function

At each position i, the model outputs a probability distribution over all B tokens (via softmax). The loss is cross-entropy between the predicted distribution and the true next token:

L = − (1/N) ∑_i=0^N−1 log P(t_i+1 | t₀, ..., t_i)

This is identical to the language model loss in GPT. The model learns to assign high probability to the actual next signal token.

Why "teacher" forcing? During training, the model always sees the ground truth tokens as input — even if its own prediction at the previous step was wrong. The "teacher" (the real signal) forces the correct context. This prevents error accumulation during training. At inference time, we don't have a teacher — we feed the model's own predictions back in (autoregressive generation). This train-test mismatch is a known issue called exposure bias.

Data Flow During Training

Signal [x₀, ..., x_N]

N+1 continuous samples

↓ quantize

Tokens [t₀, ..., t_N]

N+1 integers in {0..B−1}

↓ embed + positional encoding

Transformer input (N+1 × d)

Each row = token embed + PE

↓ causal masked self-attention × L layers

Logits (N+1 × B)

Score for each vocabulary token at each position

↓ cross-entropy loss vs [t₁, ..., t_N]

Loss scalar

Backpropagate and update weights

Teacher Forcing Simulator

Watch how the model sees ground truth inputs (orange) and makes predictions (teal) at each position. Slide through training steps to see predictions improve.

Training Step 0

During teacher forcing, what does the model receive as input at position i?

The ground truth token t_i, regardless of what the model predicted at position i−1 The model's own prediction from position i−1 A mixture of ground truth and model prediction

Chapter 5: Autoregressive Generation

Training is done. The model has learned to predict the next signal token given all past tokens. Now we switch to inference mode: generating new signal samples by feeding predictions back into the model.

The Generation Loop

Given a context (a sequence of observed signal tokens), generation works one step at a time:

Context [t₀, ..., t_k]

Known past signal tokens

↓ forward pass through transformer

P(t_k+1 | t₀..t_k)

Distribution over B possible next tokens

↓ sample or argmax

t̂_k+1

Predicted next token

↓ append to context

[t₀, ..., t_k, t̂_k+1]

Context grows by 1

↻ repeat for next position

Sampling Strategies

The model outputs a probability distribution over B tokens. How do we pick the next token?

Strategy	How	Properties
Greedy (argmax)	Pick the highest-probability token	Deterministic, safe, but can be repetitive
Temperature sampling	Divide logits by T, then sample from softmax	T<1: more peaked (conservative). T>1: flatter (creative)
Top-k	Keep only the k highest-probability tokens, renormalize, sample	Prevents sampling very unlikely tokens

Temperature for signals. For speech synthesis, low temperature (0.7–0.9) preserves the signal's natural dynamics. High temperature (1.2+) adds variety but can create unrealistic jumps. For music generation, higher temperature creates more surprising compositions. The right temperature depends on your application.

The Exposure Bias Problem

During training, the model always saw ground truth context. During generation, it sees its own predictions — including mistakes. One wrong prediction shifts the context, leading to more errors. This is exposure bias. For short predictions (5–20 steps), it's usually fine. For long generations (hundreds of steps), the signal can drift into unrealistic territory.

Mitigation strategies include scheduled sampling (gradually replacing teacher tokens with model predictions during training) and beam search (maintaining multiple candidate sequences).

Autoregressive Generation

Watch a model generate signal tokens one at a time. The orange is the known context; teal is generated. Adjust temperature to see how it affects the continuation.

Temperature 1.0

What does lowering the sampling temperature (T < 1) do to the generated signal?

Adds more noise and randomness Makes the output more deterministic, picking higher-probability tokens more often Increases the vocabulary size

Chapter 6: AR Models vs Transformers

Autoregressive signal prediction isn't new. Classical AR(p) models have been used since the 1920s. So what does the transformer bring to the table?

Classical AR(p) Model

An AR model of order p predicts the next sample as a linear combination of the past p samples:

x̂[n] = ∑_k=1^p a_k · x[n−k] + ε[n]

where a₁, ..., a_p are fixed coefficients and ε is white noise. This is elegant and efficient, but it assumes the signal is generated by a linear process. It works for speech (which is well-modeled as a linear filter driven by noise), but struggles with complex, nonlinear patterns.

Transformer Autoregressive Model

The transformer does the same thing conceptually — predict the next token from the past — but with key differences:

Property	AR(p)	Transformer
Prediction	Linear combination of past p values	Nonlinear function of entire context
Context	Fixed window of p samples	Full sequence (up to context length)
Dependencies	Short-range only	Long-range via attention
Signal type	Continuous values	Quantized discrete tokens
Output	Single predicted value	Full distribution over vocabulary
Training data	One signal (fit coefficients)	Many signals (learn general patterns)
Parameters	p coefficients	Millions (embedding + attention + FFN)

The distribution matters. An AR(p) model outputs one number: the predicted amplitude. A transformer outputs a probability distribution over all possible next tokens. This is powerful because many signals are inherently multimodal — at a silence-to-speech boundary, the next sample could be any of several phonemes. The transformer can express this uncertainty; the AR model cannot.

When to Use What

Use AR(p) when: the signal is approximately linear, you have limited data, you need real-time performance (p multiplications vs. a full transformer forward pass), or you need interpretable coefficients (e.g., formant frequencies from LPC coefficients).

Use transformers when: the signal has complex nonlinear structure, you have lots of training data, you want to model uncertainty (multi-modal predictions), or you're generating creative outputs (music, speech synthesis).

AR(p) vs Transformer Prediction

Compare a linear AR(4) model (purple) with a simulated transformer prediction (teal) on a nonlinear signal. The AR model captures the trend but misses the fine structure.

A key advantage of transformer-based signal prediction over AR(p) is:

Transformers use fewer parameters Transformers always run faster at inference Transformers output a full probability distribution and capture nonlinear, long-range dependencies

Chapter 7: Signal Predictor Showcase

Time to put it all together. Below is an interactive signal predictor. Draw a signal on the left canvas (click and drag), and a simulated transformer will predict its continuation on the right. You can adjust the vocabulary size, temperature, and prediction length.

How this works under the hood: Your drawn signal is quantized into B tokens using uniform quantization. A simplified autoregressive model (using local pattern matching on the tokenized signal — a lightweight proxy for full transformer attention) predicts the next tokens one by one. Each predicted token is dequantized back to a signal value. The result is a plausible continuation that respects the patterns in your drawing.

Draw & Predict

Draw a signal by clicking and dragging on the canvas. Then click Predict to see the transformer-style continuation. Try smooth waves, sharp spikes, or random noise — see how the predictor responds to each.

Vocab B 32

Temperature 0.8

Predict Steps 30

Token Sequence View

The same signal shown as discrete tokens. Orange = drawn (context), teal = predicted. Notice the quantization staircase — the transformer operates on these discrete levels, not the smooth curve.

Prediction Confidence

At each predicted step, the model outputs a distribution over all B tokens. The bar chart shows this distribution for the selected step. Peaked = confident. Flat = uncertain.

Pred Step 1