Demand, energy, weather, prices — predicting what comes next, with its uncertainty. From decomposition and probabilistic forecasts to N-BEATS, the Temporal Fusion Transformer, PatchTST, and zero-shot foundation models for time series.
Every plan depends on the future. How much inventory to stock, how much power to generate, how many staff to schedule, when to buy or sell — all hinge on what will happen. Time-series forecasting is predicting future values of a sequence from its past. And the decisions that ride on it need two things, not one: a prediction of what will happen, and an honest measure of how sure we are.
Classical methods (ARIMA, exponential smoothing) fit a separate statistical model to each individual series — effective but limited: they can’t share knowledge across related series, struggle with complex patterns and external factors, and scale poorly to millions of series. The deep-learning era brought global models trained across many series at once, probabilistic forecasts that quantify uncertainty, and now foundation models that forecast new series zero-shot. This lesson builds the modern toolkit, from decomposition to transformers.
History (solid), then the forecast (dashed) with a prediction interval (shaded fan) that widens into the future. Drag the horizon: the further ahead, the wider the uncertainty — the honest shape of a forecast.
The basic framing: you have a lookback window — the recent history the model sees — and you predict a forecast horizon — some number of steps into the future. A model maps the lookback to the horizon. Longer lookback gives more context; longer horizon is harder (uncertainty compounds). The lengths are key design choices.
There are two ways to produce the horizon. Autoregressive: predict one step, feed it back in, predict the next, and so on (like a language model) — flexible, but errors compound over long horizons as the model feeds on its own mistakes. Direct multi-horizon: predict the entire horizon at once in a single shot — avoids compounding, and is what many modern models (N-BEATS, PatchTST) do. The output can be a point (one value per step) or, better, a distribution (a range per step). Those choices — lookback length, horizon length, autoregressive vs direct, point vs probabilistic — define the forecasting problem.
The model sees the lookback window (teal) and predicts the horizon (orange). Drag the split: more lookback gives more context; a longer horizon is a harder ask.
Most time series are built from a few interpretable components. Trend: the slow long-term direction (sales growing year over year). Seasonality: repeating cycles (daily traffic peaks, weekly patterns, yearly holidays). Residual / noise: the irregular leftover. Classically, a series is modeled as the sum (or product) of these: trend + seasonality + residual.
This decomposition is the conceptual backbone of forecasting. If you can identify the trend and the seasonal pattern, you’ve captured most of what’s predictable — extrapolate the trend, repeat the seasonality, and the residual is the genuinely uncertain part. Many models bake this in explicitly: N-BEATS has dedicated trend and seasonality blocks; classical methods fit them separately. Even when a deep model learns it implicitly, thinking in these terms tells you what is forecastable: structure (trend, seasonality) is learnable; pure noise is not.
A series broken into its parts. Toggle the components on and off and watch them sum into the full signal. Forecasting extrapolates trend + seasonality; the noise is the irreducible uncertainty.
A good forecaster outputs a distribution, not a point. There are two common ways. Quantile forecasting: predict several quantiles directly — e.g. the 10th, 50th, and 90th percentiles — trained with the pinball (quantile) loss, which asymmetrically penalizes over- and under-prediction so each output learns its target percentile. The 10th–90th range is then an 80% prediction interval. Parametric: predict the parameters of a distribution (e.g. a Gaussian’s mean and variance) and train by maximizing likelihood — DeepAR does this, then samples to get intervals.
Why it matters so much: decisions are asymmetric. The cost of understocking (lost sales) often differs from overstocking (waste). With a full predictive distribution you can choose the quantile that matches your costs — stock to the 90th percentile if stockouts are expensive, the 50th if balanced. A point forecast throws this away. Uncertainty also grows with horizon (further out = less certain), which the distribution captures as widening intervals. Predicting the distribution, not the mean, is the single biggest practical upgrade in forecasting.
The model predicts several quantiles (the median line plus shaded 50% and 80% intervals). Drag the “noise level”: more uncertainty widens the intervals. Pick the quantile that matches whether under- or over-prediction costs you more.
The deep-learning era’s biggest shift: train one model across many series, not one model per series. A retailer with a million products doesn’t fit a million ARIMA models — it trains a single global network on all of them. The model learns patterns shared across series (weekly seasonality, holiday spikes, promotion effects) and transfers them.
The payoffs are huge. Data efficiency: a new or short series (a just-launched product) borrows strength from similar series — the cold-start problem softens. Shared structure: the model amortizes learning of common patterns once, instead of relearning per series. Scale: one model to train, deploy, and maintain for millions of series. DeepAR (Amazon, 2017) pioneered this — a global autoregressive RNN producing probabilistic forecasts — and nearly every modern method is global. The catch: series must be made comparable (normalized for scale), since one product sells 5 units and another 50,000.
Many related series (faint) share one global model, which learns common patterns. A short/new series (highlighted) borrows strength from the others — far better than fitting it alone. Drag to add more series and watch the shared model sharpen.
N-BEATS (2019) showed a pure MLP architecture — no recurrence, no attention — could beat statistical methods. Its design is a stack of blocks, each of which does something clever: it produces both a backcast (its best reconstruction of the lookback) and a forecast (its contribution to the future). The backcast is subtracted from the input, so the next block works only on the residual — what previous blocks couldn’t explain. The final forecast is the sum of all blocks’ forecasts.
This residual stacking is the same idea as boosting or ResNets: each block refines what the last left behind. And N-BEATS has an interpretable variant where blocks are constrained to specific basis functions — one stack outputs only smooth trend shapes (polynomials), another only periodic seasonality shapes (sinusoids). The decomposition from Chapter 2 becomes explicit: you can read off the learned trend and seasonality separately. N-BEATS proved deep forecasting doesn’t need fancy architectures — well-structured MLPs with residual basis decomposition are remarkably strong.
A trend block fits the smooth shape; the residual passes to a seasonality block that fits the cycles; their forecasts sum to the prediction. Drag the number of blocks — each refines the leftover, and the forecast (orange) closes on the target (teal).
Transformers came to time series, with a twist. Naively, attention over every timestep is expensive and — surprisingly — not obviously better than simple baselines. The breakthrough was PatchTST: chop the series into patches (like a Vision Transformer chops an image), and run the transformer over patches, not raw timesteps. Patching shortens the sequence (cheaper attention), and lets each token capture a local sub-pattern. PatchTST is also channel-independent (each variable forecast separately with shared weights), which proved a strong, simple recipe.
The Temporal Fusion Transformer (TFT) takes a different tack: attention plus gating, designed to handle covariates (known-future inputs like holidays and promotions) and output quantiles, with interpretable attention over time. And the frontier is foundation models for time series — TimesFM, Chronos, Moirai — pretrained on enormous, diverse time-series corpora so they forecast a brand-new series zero-shot, no training, exactly like an LLM generating text it’s never seen. This is the same trajectory as the rest of AI: from bespoke per-task models to large pretrained models that generalize.
PatchTST chops the series into patches (segments), each becoming one transformer token — shorter sequence, local sub-patterns. Drag the patch size: bigger patches = fewer, coarser tokens; smaller = more, finer.
Build a series from trend, seasonality, and noise, then forecast it with a prediction interval. Adjust the components and the horizon, and watch how the forecast extrapolates the structure while the interval widens with both noise and distance. This is forecasting in one picture: predict the predictable, quantify the rest.
Set trend strength, seasonality, and noise. The model fits the history (teal) and forecasts the horizon (orange) with an 80% interval (shaded). More noise → wider band; longer horizon → the band fans out. The structure is extrapolated; the noise becomes uncertainty.
Notice: when seasonality dominates, the forecast confidently repeats the cycle. When noise dominates, the interval balloons — the model honestly admits it can’t predict randomness. That honesty is the whole game; a forecaster that’s confidently wrong is worse than one that’s usefully unsure.
Real forecasting uses covariates — extra information beyond the target’s past:
The hard challenges: non-stationarity (the pattern itself changes — a pandemic, a new competitor — breaking models trained on the old regime); varying scale (series span orders of magnitude, so normalization is essential); distribution shift; long horizons (compounding error); and irregular sampling. As for what to use: classical methods (ARIMA, ETS) are still great for few, well-behaved series; N-BEATS / PatchTST for many series with rich patterns; TFT when covariates and interpretability matter; foundation models (TimesFM) for zero-shot or cold-start. Match the tool to the data, the horizon, and whether you need uncertainty and explanations.
A holiday (marker) causes a spike. Without the covariate the model misses it (orange); with the known-future covariate it anticipates the spike (teal). Drag to move the holiday and watch the informed forecast track it.
| Method | Type | Strength |
|---|---|---|
| ARIMA / ETS | classical, per-series | few well-behaved series |
| DeepAR | global RNN, probabilistic | many series + uncertainty |
| N-BEATS | residual MLP stacks | strong, interpretable (trend/seasonality) |
| TFT | attention + gating | covariates, quantiles, interpretable |
| PatchTST | patch transformer | strong, simple, long horizons |
| TimesFM / Chronos | foundation model | zero-shot, cold-start |
→ Embedding / Patch Layers — the patching PatchTST uses
→ Attention Variants — the attention in TFT/PatchTST
→ Gaussian Processes — another principled uncertainty tool
→ Kalman Filter — classical recursive state forecasting