Turn a continuous waveform into tokens, then let GPT-style autoregression predict the future.
You are monitoring a patient's heart rate. The sensor gives you a waveform — a continuous signal over time. Can you predict what happens next? If you could, you'd catch arrhythmias before they happen. You could compress audio by transmitting only the parts the predictor gets wrong. You could denoise by favoring likely continuations over noisy ones.
Signal prediction is one of the oldest problems in engineering. Traditional approaches — linear predictive coding, Wiener filters, ARMA models — assume the signal is generated by a linear process. They work beautifully on speech and simple time series. But what about signals with complex, nonlinear, long-range dependencies?
Here's the twist: GPT predicts the next word in a sentence. A word is just a token — a discrete symbol from a vocabulary. What if we turned our continuous signal into discrete tokens? Then we could use the exact same transformer architecture that powers ChatGPT to predict signals.
In this lesson we'll build every piece: how to turn a signal into tokens, why causal masking matters, how positional encoding gives the model a sense of time, and how teacher forcing trains the whole thing efficiently. The payoff: you'll draw a signal and watch a transformer predict its continuation.
A noisy signal (orange) with its clean continuation (teal). Can we learn to predict the future from the past? Click New Signal to see different patterns.
In NLP, tokenization splits "the cat sat" into ["the", "cat", "sat"] — discrete symbols from a vocabulary of ~50,000 words. For signals, we need to do something analogous: convert a continuous amplitude into a discrete token ID from a finite vocabulary of B levels.
The simplest approach is uniform quantization. Given a signal with minimum value xmin and maximum value xmax, we divide the range into B evenly spaced levels. The step size (distance between adjacent levels) is:
To quantize a sample x, we find the nearest level:
This gives us an integer in {0, 1, ..., B−1} — exactly like a word index in a vocabulary of size B. To reconstruct the signal value from a token:
Signal samples: [0.2, 0.7, 0.4, 0.9, 0.1]. With B = 8 levels over [0, 1]:
| Sample x | (x − xmin) / Δ | Token (round) | Reconstructed x̂ | Error |
|---|---|---|---|---|
| 0.200 | 1.40 | 1 | 0.143 | −0.057 |
| 0.700 | 4.90 | 5 | 0.714 | +0.014 |
| 0.400 | 2.80 | 3 | 0.429 | +0.029 |
| 0.900 | 6.30 | 6 | 0.857 | −0.043 |
| 0.100 | 0.70 | 1 | 0.143 | +0.043 |
Notice the maximum error is Δ/2 ≈ 0.071. This is the quantization noise floor. More levels means smaller Δ, less noise, but a larger token vocabulary for the transformer to handle.
Adjust the vocabulary size B to see how quantization resolution affects the tokenized signal. The staircase is the reconstructed signal from tokens.
In the previous lecture, we learned that self-attention lets every token attend to every other token. But there's a problem for prediction: if we want to predict token 5, we can't let the model see token 5 during training. That would be cheating — predicting the future using the future.
The solution is a causal mask (also called an autoregressive mask). It's a simple rule: token at position i can only attend to tokens at positions ≤ i. Position 3 sees tokens 0, 1, 2, 3. It cannot see tokens 4, 5, 6, ...
The attention score matrix is n × n, where entry (i, j) measures how much token i attends to token j. The causal mask sets all entries where j > i to −∞ before the softmax. Since e−∞ = 0, those positions get zero attention weight.
The result is a lower-triangular attention matrix. Each row sums to 1 (from softmax), but all weight is concentrated on current and past tokens.
You might wonder: why not just feed the model tokens [0, ..., i−1] when predicting token i? The causal mask is clever because it lets us compute all predictions simultaneously during training. Position 0 predicts token 1. Position 1 predicts token 2. Position n−1 predicts token n. All in one forward pass. This is called teacher forcing (we'll explore it in Chapter 4).
Click a token position to see what it can attend to. Bright cells = allowed, dark cells = masked (−∞). Compare with the full (unmasked) attention.
Self-attention is permutation-equivariant: shuffle the input tokens, and the output shuffles the same way. The attention mechanism doesn't know that token 3 comes after token 2 — it treats the sequence like a set. For NLP, word order matters ("dog bites man" ≠ "man bites dog"). For signals, temporal order is everything.
The fix is to add a positional encoding to each token embedding. The model receives: (what this token is) + (where this token is in time).
The original Transformer paper (Vaswani et al., 2017) uses sine and cosine functions at different frequencies:
where pos is the position index and k is the dimension index. Each dimension oscillates at a different frequency. Low dimensions change rapidly (high frequency); high dimensions change slowly (low frequency). Together, they give each position a unique fingerprint.
In NLP, position 0 is the first word. In signal prediction, position 0 is the first sample (time t=0). Position 1 is the sample at t = Ts (one sampling period later). The positional encoding gives the transformer a clock — it knows not just what each token value is, but when in the sequence it occurred.
Some models (GPT-2, GPT-3) learn the positional encoding as a trainable parameter matrix instead of using fixed sinusoids. Both work. For signals with a fixed maximum context length, learned encodings are common. For variable-length sequences or extrapolation to longer signals, sinusoidal encodings generalize better.
Each row is a position (time step). Each column is a dimension. Color intensity = encoding value. Notice: low dimensions (left) oscillate fast, high dimensions (right) oscillate slowly.
We now have all the ingredients: tokenized signals, causal masking, and positional encoding. How do we actually train this model? The answer is teacher forcing — one of the most elegant tricks in sequence modeling.
Given a sequence of tokens [t0, t1, ..., tN−1], the training target is simple: predict the next token at every position.
| Input (sees) | Target (predicts) |
|---|---|
| [t0] | t1 |
| [t0, t1] | t2 |
| [t0, t1, t2] | t3 |
| [t0, ..., tN−2] | tN−1 |
Thanks to the causal mask, we get all N−1 predictions in one forward pass. Position i's output only depends on positions 0..i, so the prediction at position i is a valid prediction of token i+1. No sequential loop needed during training.
At each position i, the model outputs a probability distribution over all B tokens (via softmax). The loss is cross-entropy between the predicted distribution and the true next token:
This is identical to the language model loss in GPT. The model learns to assign high probability to the actual next signal token.
Watch how the model sees ground truth inputs (orange) and makes predictions (teal) at each position. Slide through training steps to see predictions improve.
Training is done. The model has learned to predict the next signal token given all past tokens. Now we switch to inference mode: generating new signal samples by feeding predictions back into the model.
Given a context (a sequence of observed signal tokens), generation works one step at a time:
The model outputs a probability distribution over B tokens. How do we pick the next token?
| Strategy | How | Properties |
|---|---|---|
| Greedy (argmax) | Pick the highest-probability token | Deterministic, safe, but can be repetitive |
| Temperature sampling | Divide logits by T, then sample from softmax | T<1: more peaked (conservative). T>1: flatter (creative) |
| Top-k | Keep only the k highest-probability tokens, renormalize, sample | Prevents sampling very unlikely tokens |
During training, the model always saw ground truth context. During generation, it sees its own predictions — including mistakes. One wrong prediction shifts the context, leading to more errors. This is exposure bias. For short predictions (5–20 steps), it's usually fine. For long generations (hundreds of steps), the signal can drift into unrealistic territory.
Mitigation strategies include scheduled sampling (gradually replacing teacher tokens with model predictions during training) and beam search (maintaining multiple candidate sequences).
Watch a model generate signal tokens one at a time. The orange is the known context; teal is generated. Adjust temperature to see how it affects the continuation.
Autoregressive signal prediction isn't new. Classical AR(p) models have been used since the 1920s. So what does the transformer bring to the table?
An AR model of order p predicts the next sample as a linear combination of the past p samples:
where a1, ..., ap are fixed coefficients and ε is white noise. This is elegant and efficient, but it assumes the signal is generated by a linear process. It works for speech (which is well-modeled as a linear filter driven by noise), but struggles with complex, nonlinear patterns.
The transformer does the same thing conceptually — predict the next token from the past — but with key differences:
| Property | AR(p) | Transformer |
|---|---|---|
| Prediction | Linear combination of past p values | Nonlinear function of entire context |
| Context | Fixed window of p samples | Full sequence (up to context length) |
| Dependencies | Short-range only | Long-range via attention |
| Signal type | Continuous values | Quantized discrete tokens |
| Output | Single predicted value | Full distribution over vocabulary |
| Training data | One signal (fit coefficients) | Many signals (learn general patterns) |
| Parameters | p coefficients | Millions (embedding + attention + FFN) |
Use AR(p) when: the signal is approximately linear, you have limited data, you need real-time performance (p multiplications vs. a full transformer forward pass), or you need interpretable coefficients (e.g., formant frequencies from LPC coefficients).
Use transformers when: the signal has complex nonlinear structure, you have lots of training data, you want to model uncertainty (multi-modal predictions), or you're generating creative outputs (music, speech synthesis).
Compare a linear AR(4) model (purple) with a simulated transformer prediction (teal) on a nonlinear signal. The AR model captures the trend but misses the fine structure.
Time to put it all together. Below is an interactive signal predictor. Draw a signal on the left canvas (click and drag), and a simulated transformer will predict its continuation on the right. You can adjust the vocabulary size, temperature, and prediction length.
Draw a signal by clicking and dragging on the canvas. Then click Predict to see the transformer-style continuation. Try smooth waves, sharp spikes, or random noise — see how the predictor responds to each.
The same signal shown as discrete tokens. Orange = drawn (context), teal = predicted. Notice the quantization staircase — the transformer operates on these discrete levels, not the smooth curve.
At each predicted step, the model outputs a distribution over all B tokens. The bar chart shows this distribution for the selected step. Peaked = confident. Flat = uncertain.