Modalities & Methods

Time-Series Forecasting

Demand, energy, weather, prices — predicting what comes next, with its uncertainty. From decomposition and probabilistic forecasts to N-BEATS, the Temporal Fusion Transformer, PatchTST, and zero-shot foundation models for time series.

Prerequisites: A time series is values over time + A model maps inputs to outputs. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Why Forecast

Every plan depends on the future. How much inventory to stock, how much power to generate, how many staff to schedule, when to buy or sell — all hinge on what will happen. Time-series forecasting is predicting future values of a sequence from its past. And the decisions that ride on it need two things, not one: a prediction of what will happen, and an honest measure of how sure we are.

Classical methods (ARIMA, exponential smoothing) fit a separate statistical model to each individual series — effective but limited: they can’t share knowledge across related series, struggle with complex patterns and external factors, and scale poorly to millions of series. The deep-learning era brought global models trained across many series at once, probabilistic forecasts that quantify uncertainty, and now foundation models that forecast new series zero-shot. This lesson builds the modern toolkit, from decomposition to transformers.

The trap: “a forecast is a single number.” A point forecast with no uncertainty is dangerous — “sell 100 units” is useless without knowing whether the real number is 90–110 or 10–500. Good forecasting predicts a distribution: a range with probabilities, so you can plan for the likely and hedge the risky. Uncertainty isn’t a footnote; it’s half the answer.

A forecast with its uncertainty

History (solid), then the forecast (dashed) with a prediction interval (shaded fan) that widens into the future. Drag the horizon: the further ahead, the wider the uncertainty — the honest shape of a forecast.

forecast horizon0.50

Why is a single-number (point) forecast often insufficient?

It uses too much memory Decisions need the uncertainty too — a range with probabilities lets you plan for the likely and hedge the risky Point forecasts can’t be computed

Chapter 1: The Setup

The basic framing: you have a lookback window — the recent history the model sees — and you predict a forecast horizon — some number of steps into the future. A model maps the lookback to the horizon. Longer lookback gives more context; longer horizon is harder (uncertainty compounds). The lengths are key design choices.

There are two ways to produce the horizon. Autoregressive: predict one step, feed it back in, predict the next, and so on (like a language model) — flexible, but errors compound over long horizons as the model feeds on its own mistakes. Direct multi-horizon: predict the entire horizon at once in a single shot — avoids compounding, and is what many modern models (N-BEATS, PatchTST) do. The output can be a point (one value per step) or, better, a distribution (a range per step). Those choices — lookback length, horizon length, autoregressive vs direct, point vs probabilistic — define the forecasting problem.

Lookback → horizon

The model sees the lookback window (teal) and predicts the horizon (orange). Drag the split: more lookback gives more context; a longer horizon is a harder ask.

lookback / horizon split0.65

What is a downside of autoregressive (one-step-at-a-time) forecasting over a long horizon?

It can only predict one series Errors compound — the model feeds on its own mistakes, so they accumulate over the horizon It cannot use a lookback window

Chapter 2: Decomposition — the structure of time

Most time series are built from a few interpretable components. Trend: the slow long-term direction (sales growing year over year). Seasonality: repeating cycles (daily traffic peaks, weekly patterns, yearly holidays). Residual / noise: the irregular leftover. Classically, a series is modeled as the sum (or product) of these: trend + seasonality + residual.

This decomposition is the conceptual backbone of forecasting. If you can identify the trend and the seasonal pattern, you’ve captured most of what’s predictable — extrapolate the trend, repeat the seasonality, and the residual is the genuinely uncertain part. Many models bake this in explicitly: N-BEATS has dedicated trend and seasonality blocks; classical methods fit them separately. Even when a deep model learns it implicitly, thinking in these terms tells you what is forecastable: structure (trend, seasonality) is learnable; pure noise is not.

Signal = trend + seasonality + noise

A series broken into its parts. Toggle the components on and off and watch them sum into the full signal. Forecasting extrapolates trend + seasonality; the noise is the irreducible uncertainty.

What are the classic components a time series decomposes into?

Mean, median, mode Trend (long-term direction) + seasonality (repeating cycles) + residual (noise) Input, hidden, output

Chapter 3: Probabilistic Forecasting

A good forecaster outputs a distribution, not a point. There are two common ways. Quantile forecasting: predict several quantiles directly — e.g. the 10th, 50th, and 90th percentiles — trained with the pinball (quantile) loss, which asymmetrically penalizes over- and under-prediction so each output learns its target percentile. The 10th–90th range is then an 80% prediction interval. Parametric: predict the parameters of a distribution (e.g. a Gaussian’s mean and variance) and train by maximizing likelihood — DeepAR does this, then samples to get intervals.

Why it matters so much: decisions are asymmetric. The cost of understocking (lost sales) often differs from overstocking (waste). With a full predictive distribution you can choose the quantile that matches your costs — stock to the 90th percentile if stockouts are expensive, the 50th if balanced. A point forecast throws this away. Uncertainty also grows with horizon (further out = less certain), which the distribution captures as widening intervals. Predicting the distribution, not the mean, is the single biggest practical upgrade in forecasting.

Quantile forecast

The model predicts several quantiles (the median line plus shaded 50% and 80% intervals). Drag the “noise level”: more uncertainty widens the intervals. Pick the quantile that matches whether under- or over-prediction costs you more.

uncertainty level0.40

What does quantile forecasting (with the pinball loss) give you?

A single best-guess number Direct predictions of percentiles (e.g. 10th/50th/90th) forming prediction intervals — so you can pick the quantile matching your cost asymmetry The exact future value with certainty

Chapter 4: Global Models

The deep-learning era’s biggest shift: train one model across many series, not one model per series. A retailer with a million products doesn’t fit a million ARIMA models — it trains a single global network on all of them. The model learns patterns shared across series (weekly seasonality, holiday spikes, promotion effects) and transfers them.

The payoffs are huge. Data efficiency: a new or short series (a just-launched product) borrows strength from similar series — the cold-start problem softens. Shared structure: the model amortizes learning of common patterns once, instead of relearning per series. Scale: one model to train, deploy, and maintain for millions of series. DeepAR (Amazon, 2017) pioneered this — a global autoregressive RNN producing probabilistic forecasts — and nearly every modern method is global. The catch: series must be made comparable (normalized for scale), since one product sells 5 units and another 50,000.

One model, many series

Many related series (faint) share one global model, which learns common patterns. A short/new series (highlighted) borrows strength from the others — far better than fitting it alone. Drag to add more series and watch the shared model sharpen.

number of series8

What is a “global” forecasting model?

A separate model fit to each series One model trained across many series, sharing learned patterns — helping short/new series and scaling to millions A model that only works on global (worldwide) data

Chapter 5: N-BEATS — interpretable deep stacks

N-BEATS (2019) showed a pure MLP architecture — no recurrence, no attention — could beat statistical methods. Its design is a stack of blocks, each of which does something clever: it produces both a backcast (its best reconstruction of the lookback) and a forecast (its contribution to the future). The backcast is subtracted from the input, so the next block works only on the residual — what previous blocks couldn’t explain. The final forecast is the sum of all blocks’ forecasts.

This residual stacking is the same idea as boosting or ResNets: each block refines what the last left behind. And N-BEATS has an interpretable variant where blocks are constrained to specific basis functions — one stack outputs only smooth trend shapes (polynomials), another only periodic seasonality shapes (sinusoids). The decomposition from Chapter 2 becomes explicit: you can read off the learned trend and seasonality separately. N-BEATS proved deep forecasting doesn’t need fancy architectures — well-structured MLPs with residual basis decomposition are remarkably strong.

Residual basis stacks

A trend block fits the smooth shape; the residual passes to a seasonality block that fits the cycles; their forecasts sum to the prediction. Drag the number of blocks — each refines the leftover, and the forecast (orange) closes on the target (teal).

blocks2

How does N-BEATS’ residual stacking work?

Each block predicts the whole series independently and they vote Each block backcasts part of the input, subtracts it, and the next block works on the residual; forecasts sum (interpretable trend/seasonality blocks optional) It uses one giant recurrent network

Chapter 6: Transformers & Foundation Models

Transformers came to time series, with a twist. Naively, attention over every timestep is expensive and — surprisingly — not obviously better than simple baselines. The breakthrough was PatchTST: chop the series into patches (like a Vision Transformer chops an image), and run the transformer over patches, not raw timesteps. Patching shortens the sequence (cheaper attention), and lets each token capture a local sub-pattern. PatchTST is also channel-independent (each variable forecast separately with shared weights), which proved a strong, simple recipe.

The Temporal Fusion Transformer (TFT) takes a different tack: attention plus gating, designed to handle covariates (known-future inputs like holidays and promotions) and output quantiles, with interpretable attention over time. And the frontier is foundation models for time series — TimesFM, Chronos, Moirai — pretrained on enormous, diverse time-series corpora so they forecast a brand-new series zero-shot, no training, exactly like an LLM generating text it’s never seen. This is the same trajectory as the rest of AI: from bespoke per-task models to large pretrained models that generalize.

Patching a time series

PatchTST chops the series into patches (segments), each becoming one transformer token — shorter sequence, local sub-patterns. Drag the patch size: bigger patches = fewer, coarser tokens; smaller = more, finer.

patch size3

What is a time-series “foundation model” (e.g. TimesFM)?

A model that only works on financial data A model pretrained on huge diverse time-series corpora that forecasts a brand-new series zero-shot, like an LLM A model with no parameters

Chapter 7: Forecasting, Live (showcase)

Build a series from trend, seasonality, and noise, then forecast it with a prediction interval. Adjust the components and the horizon, and watch how the forecast extrapolates the structure while the interval widens with both noise and distance. This is forecasting in one picture: predict the predictable, quantify the rest.

Interactive forecast with uncertainty

Set trend strength, seasonality, and noise. The model fits the history (teal) and forecasts the horizon (orange) with an 80% interval (shaded). More noise → wider band; longer horizon → the band fans out. The structure is extrapolated; the noise becomes uncertainty.

trend0.40

seasonality0.60

noise0.30

Notice: when seasonality dominates, the forecast confidently repeats the cycle. When noise dominates, the interval balloons — the model honestly admits it can’t predict randomness. That honesty is the whole game; a forecaster that’s confidently wrong is worse than one that’s usefully unsure.

Chapter 8: Covariates, Challenges & What to Use

Real forecasting uses covariates — extra information beyond the target’s past:

Known-future: things you know in advance (holidays, planned promotions, day-of-week). Hugely helpful — the model can anticipate a holiday spike. TFT specializes in these.
Past-observed: related signals you only know up to now (a correlated product’s sales).
Static: per-series metadata (product category, store location).

The hard challenges: non-stationarity (the pattern itself changes — a pandemic, a new competitor — breaking models trained on the old regime); varying scale (series span orders of magnitude, so normalization is essential); distribution shift; long horizons (compounding error); and irregular sampling. As for what to use: classical methods (ARIMA, ETS) are still great for few, well-behaved series; N-BEATS / PatchTST for many series with rich patterns; TFT when covariates and interpretability matter; foundation models (TimesFM) for zero-shot or cold-start. Match the tool to the data, the horizon, and whether you need uncertainty and explanations.

Known-future covariates help

A holiday (marker) causes a spike. Without the covariate the model misses it (orange); with the known-future covariate it anticipates the spike (teal). Drag to move the holiday and watch the informed forecast track it.

holiday position0.75

What is a “known-future” covariate and why is it valuable?

A secret model parameter Information known in advance (holidays, promotions) — the model can anticipate its effect (e.g. a holiday spike) instead of being surprised The true future value, given as input

Chapter 9: Cheat Sheet & Connections

lookback window

recent history (+ covariates: known-future, past, static)

↓ model (global, across many series)

decompose / learn

trend + seasonality + residual; N-BEATS stacks, PatchTST patches, TFT attention

↓ output a distribution

probabilistic horizon

quantiles / parametric → prediction intervals that widen with horizon

Method	Type	Strength
ARIMA / ETS	classical, per-series	few well-behaved series
DeepAR	global RNN, probabilistic	many series + uncertainty
N-BEATS	residual MLP stacks	strong, interpretable (trend/seasonality)
TFT	attention + gating	covariates, quantiles, interpretable
PatchTST	patch transformer	strong, simple, long horizons
TimesFM / Chronos	foundation model	zero-shot, cold-start

Keep exploring

→ Embedding / Patch Layers — the patching PatchTST uses
→ Attention Variants — the attention in TFT/PatchTST
→ Gaussian Processes — another principled uncertainty tool
→ Kalman Filter — classical recursive state forecasting

“What I cannot create, I do not understand.” You just rebuilt modern forecasting: frame lookback→horizon, decompose into trend and seasonality, predict a distribution not a point, train one global model across many series, and choose your architecture — residual MLP stacks, patch transformers, or a zero-shot foundation model. Predict the predictable; quantify the rest.