Embedded Real-Time Systems — From Absolute Zero to Mastery

Chapter 0: What Is Real-Time?

An airbag must deploy in 10 milliseconds. Not "usually 10ms" — always 10ms. If it takes 11ms, someone dies. Your desktop computer doesn't care if a web page loads in 200ms or 250ms. But an airbag controller? A robotic arm? A pacemaker? For these systems, the time your code finishes is part of its correctness.

This is the fundamental difference. On your laptop, a program is "correct" if it computes the right answer. On an embedded real-time system, a program is correct only if it computes the right answer before its deadline. A perfect answer that arrives late is a wrong answer.

The definition: A real-time system is one where the correctness of a computation depends not only on the logical result, but also on the time at which the result is produced. Missing a deadline is a system failure.

There are three categories of real-time constraints:

Type	Deadline Miss Consequence	Example
Hard real-time	Catastrophic failure (death, destruction)	Airbag, pacemaker, fly-by-wire
Firm real-time	Result is worthless but no catastrophe	Video frame decode (dropped frame)
Soft real-time	Degraded quality but still usable	Audio streaming, UI responsiveness

The key metric for hard real-time is WCET — Worst-Case Execution Time. Not the average. Not the typical case. The absolute worst case, considering every possible branch, every cache miss, every interrupt. If your WCET exceeds your deadline, your system is broken by design, even if it "usually" works.

Why "usually works" is terrifying: An airbag controller that meets its 10ms deadline 99.99% of the time will still fail once every 10,000 deployments. With millions of cars on the road, that's hundreds of deaths. Hard real-time demands 100.000% — not statistical guarantees, but mathematical proofs.

Consider a simple control loop: read sensor, compute output, write actuator. If this loop must run at 1kHz (every 1ms), then ALL processing — sensor read, computation, actuator write — must complete within 1ms. Every. Single. Time. No garbage collection pauses. No page faults. No "just wait a moment while I resize this hash table."

This is why real-time systems use microcontrollers instead of operating systems like Linux. A microcontroller gives you:

Deterministic timing — no virtual memory, no scheduler preemption surprises
Direct hardware access — you write to registers, not through driver layers
Known WCET — you count cycles, not hope for the best

Real-Time Deadline Visualization

Three tasks with deadlines. Watch what happens when Task B takes too long. Green = met deadline. Red = missed deadline.

Task B duration 5ms

A motor controller must update PWM output every 100µs. The computation occasionally takes 120µs due to a floating-point edge case. Is this system correct?

Yes — it only happens occasionally No — the WCET exceeds the deadline, so the system is broken by design It depends on how often it happens

Chapter 1: STM32L475 Architecture

Now that you understand WHY real-time matters, let's meet the hardware that makes it possible. The STM32L475 is an ARM Cortex-M4 microcontroller made by STMicroelectronics. It's the heart of the B-L475E-IOT01A discovery board — a popular development platform for IoT and embedded applications.

Why this chip specifically? Because it sits at the sweet spot: powerful enough for real signal processing (floating-point unit, DSP instructions, 80MHz clock), yet efficient enough to run on a coin cell battery (1.1µA in STOP2 mode). It's what you'd choose for a battery-powered sensor node that occasionally needs to crunch numbers fast.

Think of it this way: Your laptop CPU has billions of transistors, gigabytes of RAM, and consumes 30+ watts. The STM32L475 has ~2 million gates, 128KB of RAM, and consumes 100 microamps during active processing. Same ARM instruction set family. Vastly different design goals.

Core Specifications

Feature	STM32L475	Why It Matters
CPU	ARM Cortex-M4F @ 80MHz	Single-cycle multiply, hardware FPU, DSP extensions
Flash	1 MB	Your program lives here (non-volatile)
SRAM	128 KB	Variables, stack, heap (volatile)
FPU	Single-precision IEEE 754	Hardware float in 1 cycle vs 20+ in software
Low-power	STOP2: 1.1µA	Years on a coin cell with periodic wake-up
Timers	16 timers (2×32-bit, 14×16-bit)	PWM, input capture, periodic interrupts
ADC	3×12-bit, 5 Msps	Read analog sensors (temperature, voltage, current)
Comms	3×SPI, 3×I2C, 6×USART, USB	Talk to sensors, displays, radios, PCs

Memory Map

The ARM Cortex-M4 uses a memory-mapped I/O architecture. This means peripherals (timers, GPIO, UART) appear at specific addresses in the same address space as RAM and Flash. Writing to address 0x48000014 doesn't write to RAM — it sets the output pins on GPIO port A. This is how you control hardware: by writing to magic addresses.

Address Range	What Lives Here	Size
`0x0800_0000`	Flash (your program)	1 MB
`0x2000_0000`	SRAM (your variables)	128 KB
`0x4000_0000`	APB1 peripherals (TIM2-7, USART2-5, SPI2-3, I2C1-3)	-
`0x4001_0000`	APB2 peripherals (TIM1/8/15-17, USART1, SPI1, ADC)	-
`0x4002_0000`	AHB1 peripherals (DMA, RCC, Flash control)	-
`0x4800_0000`	AHB2 peripherals (GPIO A-H, ADC, RNG)	-
`0xE000_0000`	Cortex-M4 internals (NVIC, SysTick, debug)	-

Key insight: Everything is an address. There is no "open file," no "call driver." You enable a peripheral by writing a '1' to a specific bit at a specific address. You read a sensor by reading from a specific address. The entire system is just load/store operations to memory-mapped registers.

The Clock Tree

The STM32L475 has a complex clock system. The main system clock (SYSCLK) can come from multiple sources: an internal 4MHz MSI oscillator, an internal 16MHz HSI, or an external crystal (HSE). A PLL (Phase-Locked Loop) multiplies these up. For maximum performance: HSI16 → PLL → 80MHz SYSCLK.

Each peripheral bus has its own clock divider:

AHB bus: up to 80MHz (GPIO, DMA)
APB1 bus: up to 80MHz (TIM2-7, USART2-5, I2C, SPI2-3)
APB2 bus: up to 80MHz (TIM1/8/15-17, USART1, SPI1, ADC)

Before you can USE any peripheral, you must ENABLE its clock. The RCC (Reset and Clock Control) block controls which peripherals get a clock signal. Peripheral with clock disabled = completely dead, draws zero power.

STM32L475 Block Diagram

Click on any peripheral block to see its base address and key features. The orange paths show the clock distribution.

Before you can toggle a GPIO pin on the STM32L475, what must you do first?

Install a device driver Call GPIO_Init() from the HAL library Enable its clock in the RCC register Configure the PLL to 80MHz

Chapter 2: Register-Level Programming

Forget HAL. Forget Arduino. Forget every abstraction layer you've ever used. We're going bare metal. On a microcontroller, controlling hardware means writing specific values to specific memory addresses. These addresses are called registers — 32-bit locations that directly control hardware behavior.

Why bare metal? Because in real-time systems, you need to know exactly what your code does and exactly how long it takes. HAL functions hide complexity, add overhead, and make timing unpredictable. A single HAL_GPIO_WritePin() call might take 8-15 cycles depending on debug checks. A direct register write takes exactly 1 cycle.

The philosophy: Every peripheral is controlled by a small set of 32-bit registers at known addresses. Each bit in each register has a specific meaning defined in the reference manual. You read the manual, you write the bits, hardware responds. No magic.

Worked Example: Blink LED on PB14

The B-L475E-IOT01A board has an LED connected to pin PB14 (Port B, pin 14). To blink it, we need three steps: (1) enable GPIOB's clock, (2) configure pin 14 as output, (3) toggle the pin.

Step 1: Enable GPIOB clock (RCC_AHB2ENR)

The RCC AHB2 peripheral clock enable register lives at address 0x4002_104C. Bit 1 controls GPIOB's clock.

c
// RCC base: 0x40021000
// AHB2ENR offset: 0x4C
// Bit 1: GPIOBEN
*(volatile uint32_t*)0x4002104C |= (1 << 1);
// Equivalent: RCC->AHB2ENR |= RCC_AHB2ENR_GPIOBEN;

Step 2: Configure PB14 as general-purpose output (GPIOB_MODER)

The MODER register controls pin mode. Each pin uses 2 bits: 00=input, 01=output, 10=alternate function, 11=analog. Pin 14 occupies bits [29:28].

c
// GPIOB base: 0x48000400
// MODER offset: 0x00
// Bits [29:28] for pin 14: set to 01 (output)
volatile uint32_t* GPIOB_MODER = (volatile uint32_t*)0x48000400;
*GPIOB_MODER &= ~(3 << 28);  // Clear bits 29:28
*GPIOB_MODER |=  (1 << 28);  // Set to 01 (output)

Step 3: Toggle PB14 (GPIOB_ODR)

The Output Data Register (ODR) at offset 0x14 directly controls pin state. Bit 14 = pin 14.

c
// GPIOB_ODR at 0x48000414
volatile uint32_t* GPIOB_ODR = (volatile uint32_t*)0x48000414;
*GPIOB_ODR ^= (1 << 14);  // XOR toggles the bit

Timing guarantee: That XOR-and-store compiles to a single STR instruction. On the Cortex-M4 at 80MHz, one bus transaction to AHB2 takes exactly 2 clock cycles = 25 nanoseconds. You know EXACTLY when the pin changes state.

The Complete Blink Program

c
// Bare-metal LED blink — STM32L475, PB14
// No HAL, no libraries, no RTOS

#include <stdint.h>

#define RCC_AHB2ENR   (*(volatile uint32_t*)0x4002104C)
#define GPIOB_MODER   (*(volatile uint32_t*)0x48000400)
#define GPIOB_ODR     (*(volatile uint32_t*)0x48000414)

void delay(volatile uint32_t count) {
    while(count--);  // ~3 cycles per iteration
}

int main(void) {
    // 1. Enable GPIOB clock
    RCC_AHB2ENR |= (1 << 1);

    // 2. Configure PB14 as output
    GPIOB_MODER &= ~(3 << 28);  // Clear
    GPIOB_MODER |=  (1 << 28);  // Output mode

    // 3. Blink forever
    while(1) {
        GPIOB_ODR ^= (1 << 14);  // Toggle LED
        delay(800000);            // ~100ms at 80MHz
    }
}

That's 10 lines of actual logic. No initialization framework, no HAL_Init(), no SystemClock_Config() abstraction. You understand every single byte that flows to the hardware.

The BSRR Register: Atomic Set/Reset

There's a subtle problem with ODR ^= (1 << 14). It's a read-modify-write operation: read ODR, XOR with mask, write back. If an interrupt fires between the read and write, and that ISR also modifies ODR, you get a race condition. The solution is the BSRR (Bit Set/Reset Register):

c
// GPIOB_BSRR at 0x48000418
// Bits [15:0]  — write 1 to SET corresponding pin
// Bits [31:16] — write 1 to RESET corresponding pin
#define GPIOB_BSRR  (*(volatile uint32_t*)0x48000418)

GPIOB_BSRR = (1 << 14);       // SET pin 14 (atomic, single write)
GPIOB_BSRR = (1 << (14+16)); // RESET pin 14 (atomic, single write)

Why BSRR exists: A write to BSRR is a single STR instruction — no read-modify-write. It's inherently atomic. No interrupt can corrupt it. This is a hardware-level solution to a concurrency problem. Real-time engineers think about this constantly.

32-Bit Register Viewer

Click individual bits to set/clear them. Watch the hex value update. This is GPIOB_MODER — each pair of bits configures one pin's mode.

Why do we use BSRR instead of ODR for setting GPIO pins in interrupt-heavy code?

BSRR is a single atomic write (no read-modify-write race condition) BSRR is faster BSRR uses less power

Chapter 3: ARM Cortex-M4 Assembly

Sometimes C isn't enough. When you need cycle-precise timing, when you're writing the first instructions that run at boot (the reset handler), or when you need to understand exactly what the compiler generated — you need assembly. The Cortex-M4 uses the Thumb-2 instruction set: a mix of 16-bit and 32-bit instructions that balances code density with performance.

Don't panic. ARM assembly is remarkably readable compared to x86. Most instructions do exactly one thing: load, store, add, compare, branch. No cryptic prefixes, no segment registers, no stack machine weirdness.

When you'll actually use assembly: (1) Startup code — the very first instructions after reset. (2) Critical ISRs where you need exact cycle counts. (3) DSP inner loops (MAC operations). (4) Context switching in an RTOS. (5) Reading compiler output to verify optimization.

The Register File

The Cortex-M4 has 16 general-purpose 32-bit registers:

Register	Name	Purpose
R0–R3	Arguments / scratch	Function arguments, return value (R0), caller-saved
R4–R11	Callee-saved	Preserved across function calls, must be saved/restored
R12	IP (Intra-Procedure)	Scratch register, used by linker veneers
R13	SP (Stack Pointer)	Points to top of stack (two banks: MSP and PSP)
R14	LR (Link Register)	Return address for function calls (BL stores PC here)
R15	PC (Program Counter)	Address of next instruction to execute

Essential Instructions

arm
@ Data movement
MOV  R0, #42          @ R0 = 42 (immediate value)
MOV  R1, R0           @ R1 = R0 (register to register)
LDR  R0, [R1]         @ R0 = memory[R1] (load from address)
STR  R0, [R1]         @ memory[R1] = R0 (store to address)
LDR  R0, =0x48000418  @ R0 = 0x48000418 (load constant)

@ Arithmetic
ADD  R0, R1, R2       @ R0 = R1 + R2
SUB  R0, R1, #1      @ R0 = R1 - 1
MUL  R0, R1, R2       @ R0 = R1 * R2 (single cycle on M4!)

@ Bitwise
ORR  R0, R0, #(1<<14) @ Set bit 14
BIC  R0, R0, #(1<<14) @ Clear bit 14 (Bit Clear)
EOR  R0, R0, #(1<<14) @ Toggle bit 14 (XOR)

@ Compare and branch
CMP  R0, #0          @ Compare R0 with 0 (sets flags)
BEQ  label            @ Branch if equal (Z flag set)
BNE  label            @ Branch if not equal
BL   function         @ Branch with Link (call: saves PC to LR)
BX   LR               @ Branch to LR (return from function)

Worked Example: GPIO Toggle in Assembly

Let's write the PB14 toggle entirely in assembly. This sets pin 14 high using BSRR:

arm
@ Toggle PB14 via BSRR — 4 instructions, 4 cycles
toggle_led:
    LDR  R0, =0x48000418  @ R0 = address of GPIOB_BSRR
    LDR  R1, [R0, #-4]   @ R1 = GPIOB_ODR (0x414 = BSRR-4)
    EOR  R1, R1, #(1<<14) @ Toggle bit 14 in our copy
    STR  R1, [R0, #-4]   @ Write back to ODR
    BX   LR               @ Return

Now compare to what the C compiler generates from GPIOB_ODR ^= (1 << 14); at -O2 optimization:

arm
@ GCC -O2 output for GPIOB_ODR ^= (1 << 14)
    LDR  R3, =0x48000414  @ Load ODR address
    LDR  R2, [R3]          @ Read current ODR value
    EOR  R2, R2, #16384   @ XOR with (1<<14) = 16384
    STR  R2, [R3]          @ Write back

Good news: At -O2, GCC generates essentially optimal code — 4 instructions, same as our hand-written version. The compiler is your friend when you give it optimization flags. Assembly is for the rare cases where you need guarantees the compiler can't provide (interrupt-disabled sections, exact cycle timing, specific instruction ordering).

The Calling Convention (AAPCS)

When calling a function from C or from another assembly routine, ARM follows strict rules:

Arguments

R0, R1, R2, R3 (first 4 args). Additional args go on stack.

↓

Return value

R0 (32-bit) or R0+R1 (64-bit)

↓

Callee must save

R4–R11, LR (if calling another function)

↓

Caller must save

R0–R3, R12 (if you need them after the call)

ARM Register File — Step Through Assembly

Watch registers change as each instruction executes. Orange = just modified. Click Step to advance.

Ready. Click Step to begin.

After executing BL myFunction, what does the LR (R14) register contain?

The address of myFunction The address of the instruction AFTER the BL (the return address) Zero

Chapter 4: Timer Configuration (Deep Dive)

Timers are the heartbeat of real-time systems. They generate periodic interrupts ("wake me up every 1ms"), measure external signal timing (input capture), and produce precise output waveforms (PWM). The STM32L475 has 16 timers. We'll focus on TIM2 — a 32-bit general-purpose timer clocked at up to 80MHz.

A timer is surprisingly simple at its core: it's just a counter that increments every clock tick. When the counter reaches a programmed value, it resets to zero and optionally fires an interrupt. That's it. The complexity comes from the many ways you can configure the clock source, counting direction, and output behavior.

The mental model: Think of a timer as a metronome. You set how fast it ticks (prescaler) and how many ticks until it "dings" (auto-reload value). When it dings, it can wake up your code, toggle a pin, or trigger another peripheral (like an ADC).

TIM2 Registers

TIM2 base address: 0x4000_0000. The key registers:

Register	Offset	Purpose
CR1	0x00	Control register 1 — enable timer, set counting mode
DIER	0x0C	DMA/Interrupt enable — which events generate interrupts
SR	0x10	Status register — which events have occurred (clear by writing 0)
CNT	0x24	Counter value — the actual 32-bit count
PSC	0x28	Prescaler — divides input clock by (PSC+1)
ARR	0x2C	Auto-reload — counter resets when it reaches this value

Worked Example: 1ms Periodic Interrupt

Goal: TIM2 fires an interrupt every 1ms (1kHz). The timer clock is 80MHz.

f_interrupt = f_clock / ((PSC + 1) × (ARR + 1))

We want f_interrupt = 1000 Hz. So:

1000 = 80,000,000 / ((PSC + 1) × (ARR + 1))

(PSC + 1) × (ARR + 1) = 80,000

Choose PSC = 79, ARR = 999:

(79 + 1) × (999 + 1) = 80 × 1000 = 80,000 ✔

Verify: 80MHz / 80,000 = 1000 Hz = 1ms period. Perfect.

c
// Configure TIM2 for 1ms interrupt at 80MHz

// 1. Enable TIM2 clock (RCC APB1ENR1, bit 0)
*(volatile uint32_t*)0x40021058 |= (1 << 0);  // RCC_APB1ENR1 |= TIM2EN

// 2. Set prescaler: divide 80MHz by 80 → 1MHz tick
*(volatile uint32_t*)0x40000028 = 79;  // TIM2_PSC = 79

// 3. Set auto-reload: count 1000 ticks → 1ms
*(volatile uint32_t*)0x4000002C = 999;  // TIM2_ARR = 999

// 4. Enable update interrupt (DIER bit 0 = UIE)
*(volatile uint32_t*)0x4000000C |= (1 << 0);  // TIM2_DIER |= UIE

// 5. Enable timer (CR1 bit 0 = CEN)
*(volatile uint32_t*)0x40000000 |= (1 << 0);  // TIM2_CR1 |= CEN

// 6. Enable TIM2 interrupt in NVIC (IRQ #28)
*(volatile uint32_t*)0xE000E100 |= (1 << 28);  // NVIC_ISER0 bit 28

Why PSC=79 and ARR=999? We could also choose PSC=7999, ARR=9. Or PSC=0, ARR=79999. The interrupt rate is identical. But PSC=79 gives a nice 1MHz internal tick rate — each CNT increment = 1 microsecond. This makes debugging easier: read CNT and you directly know elapsed microseconds.

How the Counter Works

After enabling (CEN=1), the hardware does this in an infinite loop:

Clock tick

80MHz input divided by (PSC+1) = 1MHz

↓

CNT++

Counter increments: 0, 1, 2, ... 999

↓

CNT == ARR?

Counter reached 999?

↓ yes

Update event

Set UIF flag in SR, fire interrupt if UIE=1, reset CNT to 0

↻ repeat forever

Timer Counter Animation

Watch the counter increment, hit ARR, reset, and fire an interrupt. Adjust PSC and ARR to change the timing.

PSC 79

ARR 999

Freq: 1000.0 Hz (1.000ms period)

You need a 50Hz interrupt (20ms period) from TIM2 at 80MHz. With PSC=7999, what should ARR be?

49 100 199 999

Chapter 5: Interrupts & the NVIC

Your timer is counting. When it hits ARR, it needs to tell the CPU "hey, time's up!" It can't just wait for the CPU to check — that's polling, and polling wastes cycles. Instead, the timer sends an interrupt: an asynchronous hardware signal that forces the CPU to immediately stop what it's doing and jump to a handler function.

The NVIC (Nested Vectored Interrupt Controller) is the traffic cop. It receives interrupt requests from all 82 possible sources on the STM32L475 (timers, GPIO, UART, DMA, ADC...) and decides which one the CPU handles first, based on priority.

The key word is "nested": A higher-priority interrupt can interrupt a lower-priority interrupt handler that's already running. This is called preemption. If your 1kHz motor control ISR is running and a fault interrupt fires (higher priority), the motor ISR gets suspended mid-instruction, the fault handler runs to completion, then the motor ISR resumes. This is how you guarantee critical interrupts always get served immediately.

How Interrupts Work (Step by Step)

1. Peripheral raises IRQ

Timer UIF flag set → interrupt request to NVIC

↓

2. NVIC checks priority

Is this IRQ higher priority than current execution? If yes → preempt.

↓

3. Hardware stacks context

CPU pushes R0-R3, R12, LR, PC, xPSR to stack (8 registers, 12 cycles)

↓

4. PC loads handler address

From vector table at (IRQ# + 16) × 4 bytes from table base

↓

5. ISR executes

Your handler code runs. MUST clear the interrupt flag!

↓

6. Return from ISR

BX LR with special EXC_RETURN value → hardware unstacks context

Interrupt Latency

On the Cortex-M4, the time from interrupt assertion to first ISR instruction is 12 cycles (150ns at 80MHz). This includes stacking 8 registers. The NVIC also supports tail-chaining: if another interrupt is pending when an ISR returns, the CPU skips the unstack/restack sequence and jumps directly to the next handler in just 6 cycles.

Event	Cycles	Time @ 80MHz
Interrupt entry (stacking + fetch)	12	150 ns
Interrupt return (unstacking)	12	150 ns
Tail-chain (back-to-back ISRs)	6	75 ns
Late-arriving (higher priority during stack)	0 extra	Redirects immediately

Priority Configuration

The STM32L475 uses 4 bits for priority (values 0–15, where 0 is highest priority). These 4 bits are split into preemption priority and sub-priority using a configurable group setting. With the default grouping (4 bits preempt, 0 sub):

c
// Set TIM2 interrupt (IRQ #28) to priority 2
// NVIC_IPR registers at 0xE000E400, one byte per IRQ
// Priority in top 4 bits of the byte
*(volatile uint8_t*)(0xE000E400 + 28) = (2 << 4);

// Set EXTI0 (IRQ #6) to priority 1 (higher than TIM2)
*(volatile uint8_t*)(0xE000E400 + 6) = (1 << 4);

Worked Example: External Interrupt on PA0 (Falling Edge)

c
// Configure EXTI0 for falling edge on PA0 (user button)

// 1. Enable GPIOA clock
*(volatile uint32_t*)0x4002104C |= (1 << 0);  // RCC_AHB2ENR bit 0

// 2. Enable SYSCFG clock (needed for EXTI mux)
*(volatile uint32_t*)0x40021060 |= (1 << 0);  // RCC_APB2ENR bit 0

// 3. Map EXTI0 to PA0 (SYSCFG_EXTICR1, bits [3:0] = 0000 = Port A)
*(volatile uint32_t*)0x40010008 &= ~0xF;  // SYSCFG_EXTICR1 bits[3:0] = PA

// 4. Configure falling edge trigger (EXTI_FTSR1 bit 0)
*(volatile uint32_t*)0x4001000C |= (1 << 0);  // EXTI_FTSR1

// 5. Unmask EXTI0 (EXTI_IMR1 bit 0)
*(volatile uint32_t*)0x40010000 |= (1 << 0);  // EXTI_IMR1

// 6. Enable EXTI0 in NVIC (IRQ #6)
*(volatile uint32_t*)0xE000E100 |= (1 << 6);  // NVIC_ISER0

// 7. ISR handler (name must match vector table)
void EXTI0_IRQHandler(void) {
    // Clear pending bit (write 1 to clear!)
    *(volatile uint32_t*)0x40010014 = (1 << 0);  // EXTI_PR1
    // Do something (toggle LED, set flag, etc.)
    GPIOB_ODR ^= (1 << 14);
}

Critical detail: You MUST clear the interrupt flag (EXTI_PR1 for external interrupts, TIM2_SR for timers) inside the ISR. If you forget, the interrupt fires again immediately after returning, creating an infinite loop that locks up your system.

NVIC Priority & Preemption

Multiple interrupts fire at different times. Watch how the NVIC handles preemption and tail-chaining. Lower number = higher priority.

TIM2 ISR (priority 3) is running. EXTI0 fires with priority 1. What happens?

EXTI0 waits until TIM2 ISR finishes TIM2 ISR is suspended, EXTI0 handler runs immediately (preemption) Both run simultaneously on different cores

Chapter 6: ISR Design Patterns

Here's the golden rule of interrupt service routines: get in, do the minimum, get out. Every cycle you spend inside an ISR is a cycle where lower-priority interrupts are blocked. A long ISR doesn't just slow down your system — it can cause other interrupts to miss their deadlines.

What's "the minimum"? Set a flag. Copy one byte to a buffer. Start a DMA transfer. That's it. Never: allocate memory, call printf, do floating-point math, or loop over arrays inside an ISR.

The rule of thumb: Your ISR should take less than 10% of the interrupt period. For a 1kHz interrupt (1ms period), the ISR should complete in under 100µs. For a 10kHz interrupt (100µs period), under 10µs. Violate this and you're headed for missed deadlines.

Pattern 1: Flag-Based (Simplest)

The ISR sets a volatile flag. The main loop checks and clears it. The word volatile tells the compiler "this variable can change at any time outside normal program flow — never optimize away reads of it."

c
volatile uint8_t timer_flag = 0;

void TIM2_IRQHandler(void) {
    TIM2_SR &= ~(1 << 0);  // Clear UIF (2 cycles)
    timer_flag = 1;         // Set flag (1 cycle)
}  // Total ISR: ~5 cycles = 62.5ns

int main(void) {
    // ... setup ...
    while(1) {
        if(timer_flag) {
            timer_flag = 0;
            // Do the heavy processing here (main context)
            process_sensor_data();
            update_display();
        }
    }
}

Pattern 2: Ring Buffer (For Streaming Data)

When data arrives byte-by-byte (UART, SPI), you need a buffer. A ring buffer (circular buffer) lets the ISR write and the main loop read without blocking each other, as long as the buffer doesn't overflow.

c
#define BUF_SIZE 64  // Must be power of 2 for fast modulo
volatile uint8_t buf[BUF_SIZE];
volatile uint8_t head = 0;  // ISR writes here
volatile uint8_t tail = 0;  // Main reads here

void USART1_IRQHandler(void) {
    uint8_t byte = USART1_RDR;           // Read received byte
    buf[head] = byte;                     // Store in buffer
    head = (head + 1) & (BUF_SIZE - 1);  // Advance head (wraps)
}

// Main loop reads when data available
while(tail != head) {
    uint8_t data = buf[tail];
    tail = (tail + 1) & (BUF_SIZE - 1);
    process(data);
}

Why power-of-2 buffer size? The modulo operation head % BUF_SIZE is slow (division). But head & (BUF_SIZE - 1) is a single AND instruction when BUF_SIZE is a power of 2. In an ISR, every cycle counts.

Pattern 3: Double Buffer (For Block Processing)

When you're sampling at high rates (e.g., audio at 48kHz), you can't process each sample individually. Instead: fill one buffer while processing the other, then swap. The ISR fills Buffer A, triggers a "buffer full" flag, main processes A while ISR fills B, repeat.

c
#define BLOCK_SIZE 256
volatile int16_t bufA[BLOCK_SIZE];
volatile int16_t bufB[BLOCK_SIZE];
volatile int16_t* fill_buf = bufA;   // ISR writes here
volatile int16_t* proc_buf = bufB;   // Main reads here
volatile uint16_t fill_idx = 0;
volatile uint8_t buffer_ready = 0;

void ADC1_IRQHandler(void) {
    fill_buf[fill_idx++] = ADC1_DR;   // Read sample
    if(fill_idx >= BLOCK_SIZE) {
        fill_idx = 0;
        // Swap buffers
        volatile int16_t* tmp = fill_buf;
        fill_buf = proc_buf;
        proc_buf = tmp;
        buffer_ready = 1;
    }
}

What NEVER To Do in an ISR

Anti-pattern	Why It's Fatal	Correct Alternative
`printf()`	Calls malloc, UART waits, 1000+ cycles	Set flag, print in main
`malloc()/free()`	Non-deterministic time, can fragment	Pre-allocate all buffers
`for(i=0; i<1000;...)`	Blocks all lower-priority interrupts	Process one element per ISR call
Float math	FPU context save adds 17 cycles entry	Use fixed-point in ISR, float in main
Forget to clear flag	ISR re-enters immediately = system hang	ALWAYS clear flag first thing

ISR Timing: Short vs Long

Compare a well-designed ISR (flag + deferred processing) vs a bloated ISR (all processing inline). Watch how the long ISR blocks subsequent interrupts.

Your ADC ISR currently takes 80µs. It fires every 100µs (10kHz). What percentage of CPU time does it consume?

80% — only 20µs left for main loop and lower-priority interrupts 8% — plenty of headroom 50% — it's fine

Chapter 7: Timer + Interrupt System (Showcase)

Time to put it all together. We'll build a complete real-time data acquisition system: three timers running at different rates, an ADC sampling sensor data, DMA transferring results, and a main loop that processes data when buffers are full. This is exactly how a real embedded sensor node works.

The system: TIM2 @ 1kHz triggers ADC1 to sample a sensor. DMA1 transfers ADC results into a 256-sample buffer. When the buffer fills, main loop processes it (FFT, averaging, whatever). Meanwhile, TIM3 @ 10Hz updates a display, and TIM7 @ 1Hz blinks a heartbeat LED. All running concurrently via interrupts.

System Architecture

Timer	Rate	PSC	ARR	Purpose	Priority
TIM2	1 kHz	79	999	ADC trigger (sampling)	1 (highest)
TIM3	10 Hz	7999	999	Display update	3
TIM7	1 Hz	7999	9999	Heartbeat LED	7 (lowest)

Verification: TIM3 = 80MHz / (8000 × 1000) = 10Hz. TIM7 = 80MHz / (8000 × 10000) = 1Hz. Correct.

Priority Assignment Rationale

The ADC sampling MUST happen exactly on time (hard real-time) — a jittered sample corrupts the frequency content of the signal. Display update can slip a few ms without anyone noticing (soft real-time). The heartbeat LED is purely cosmetic. Priority reflects criticality.

Complete System Code

c
// Complete real-time data acquisition system
// STM32L475 @ 80MHz, bare metal

#define BLOCK_SIZE 256
volatile uint16_t adc_bufA[BLOCK_SIZE];
volatile uint16_t adc_bufB[BLOCK_SIZE];
volatile uint16_t* adc_fill = adc_bufA;
volatile uint16_t* adc_proc = adc_bufB;
volatile uint16_t adc_idx = 0;
volatile uint8_t data_ready = 0;
volatile uint8_t display_flag = 0;
volatile uint32_t sample_count = 0;

// TIM2 ISR: 1kHz ADC sampling (priority 1)
void TIM2_IRQHandler(void) {
    TIM2_SR = 0;  // Clear ALL flags (fast: single write)
    ADC1_CR |= (1 << 2);  // Start ADC conversion (ADSTART)
}

// ADC ISR: conversion complete
void ADC1_IRQHandler(void) {
    adc_fill[adc_idx] = ADC1_DR;  // Read clears EOC flag
    adc_idx++;
    sample_count++;
    if(adc_idx >= BLOCK_SIZE) {
        adc_idx = 0;
        volatile uint16_t* tmp = adc_fill;
        adc_fill = adc_proc;
        adc_proc = tmp;
        data_ready = 1;
    }
}

// TIM3 ISR: 10Hz display update (priority 3)
void TIM3_IRQHandler(void) {
    TIM3_SR = 0;
    display_flag = 1;
}

// TIM7 ISR: 1Hz heartbeat LED (priority 7)
void TIM7_IRQHandler(void) {
    TIM7_SR = 0;
    GPIOB_ODR ^= (1 << 14);  // Toggle LED
}

int main(void) {
    setup_clocks();   // 80MHz from PLL
    setup_gpio();     // PB14 output
    setup_adc();      // ADC1 channel, 12-bit
    setup_tim2();     // 1kHz
    setup_tim3();     // 10Hz
    setup_tim7();     // 1Hz

    while(1) {
        if(data_ready) {
            data_ready = 0;
            process_block(adc_proc, BLOCK_SIZE);
        }
        if(display_flag) {
            display_flag = 0;
            update_display(sample_count);
        }
        __WFI();  // Sleep until next interrupt
    }
}

Timing Budget

ISR	Cycles	Time	% of Period
TIM2 (1kHz)	~10	125ns	0.013%
ADC1 (1kHz)	~20	250ns	0.025%
TIM3 (10Hz)	~8	100ns	<0.001%
TIM7 (1Hz)	~10	125ns	<0.001%
Total overhead			<0.05%

The CPU spends 99.95% of its time either sleeping (WFI) or in the main loop processing data. The interrupt overhead is negligible because we followed the "short ISR" pattern.

Real-Time Data Acquisition System

Full system simulation. Three timers fire at different rates. ADC samples fill a buffer. Main loop processes when full. Adjust timer periods and watch the system respond. Red flash = deadline miss.

ADC Rate 1000 Hz

Processing Time 50 ms

Block Size 256

System idle. Press Start.

Chapter 8: Low-Power Real-Time

Here's a paradox: real-time systems must respond instantly, but many of them run on batteries. A sensor node that wakes up every second to read temperature, then sleeps for 999ms, can run for years on a coin cell. The trick is the WFI (Wait For Interrupt) instruction: it halts the CPU until the next interrupt fires. Zero power consumption while waiting, instant wake-up when needed.

The STM32L475 is designed for exactly this use case. Its STOP2 mode draws only 1.1 microamps while keeping SRAM and registers alive. Wake-up time from STOP2 is about 3.3µs — fast enough for most applications.

The insight: Real-time doesn't mean "always running." It means "responding within the deadline WHEN something happens." Between events, the CPU should be asleep. A well-designed embedded system is asleep 99%+ of the time.

Power Modes on STM32L475

Mode	Current	Wake-up Time	What's Preserved	Wake Sources
Run	100 µA/MHz	-	Everything	-
Sleep	~25 µA/MHz	~1 µs	All, CPU halted	Any interrupt
STOP2	1.1 µA	3.3 µs	SRAM, regs, RTC	EXTI, RTC, LPTIM
Standby	0.3 µA	50 µs	Backup regs only	WKUP pins, RTC
Shutdown	0.03 µA	~ms	Nothing	WKUP pins

WFI: The Simplest Power Savings

c
// Simple sleep-between-interrupts pattern
// CPU runs at 80MHz only during ISR + main processing
// Sleeps at ~2mA the rest of the time

int main(void) {
    setup_all_peripherals();

    while(1) {
        if(data_ready) {
            data_ready = 0;
            process_data();      // Runs at 80MHz, takes ~5ms
        }
        __WFI();  // ARM instruction: Wait For Interrupt
        // CPU sleeps here until ANY enabled interrupt fires
        // Wake-up is instant (1 cycle latency)
    }
}

Tickless Idle: Maximum Power Savings

The standard approach uses SysTick (a periodic 1ms interrupt) for timekeeping. But SysTick wakes the CPU 1000 times per second even when there's nothing to do. Tickless idle stops SysTick entirely and programs the RTC alarm for the next scheduled event:

c
// Tickless idle: sleep until next event, not next tick

uint32_t next_event_ms = get_next_scheduled_time();
uint32_t sleep_duration = next_event_ms - current_time_ms;

// Disable SysTick
SYSTICK_CTRL &= ~(1 << 0);

// Program RTC wake-up timer
RTC_WUTR = sleep_duration;  // Wake in sleep_duration ms
RTC_CR |= (1 << 10);       // Enable wake-up timer

// Enter STOP2 mode
PWR_CR1 |= (1 << 0);  // LPMS = STOP2
SCB_SCR |= (1 << 2);  // SLEEPDEEP = 1
__WFI();

// --- wake up here ---
// Restore clocks (STOP2 resets to MSI 4MHz)
restore_80mhz_clock();
// Update time accounting
current_time_ms += actual_sleep_duration();
// Re-enable SysTick if needed
SYSTICK_CTRL |= (1 << 0);

Power Budget Calculation

A sensor node that samples at 1Hz, processes for 5ms, then sleeps in STOP2:

I_avg = (I_active × t_active + I_sleep × t_sleep) / T_period

I_avg = (8mA × 5ms + 1.1µA × 995ms) / 1000ms

I_avg = (40µA·s + 1.095µA·s) / 1s = 41.1 µA

With a CR2032 coin cell (225 mAh):

Battery life = 225mAh / 0.0411mA = 5,474 hours = 228 days

Compare to always-on at 80MHz (8mA): 225mAh / 8mA = 28 hours. Sleep mode gives you 195x improvement.

Power Timeline Visualization

Active bursts (high current) separated by sleep periods. Adjust the wake rate and processing time to see the effect on battery life.

Wake rate (Hz) 1.0 Hz

Active time (ms) 5 ms

Avg: 41.1 µA | Battery: 228 days (CR2032)

Your sensor node wakes every 100ms (10Hz) for 2ms of processing. What's the approximate average current if active current is 8mA and STOP2 is 1.1µA?

4 mA ~161 µA (8mA×2ms + 1.1µA×98ms) / 100ms 1.1 µA

Chapter 9: Mastery & Connections

You now understand the full stack: from real-time deadlines down to individual register bits. Let's consolidate with reference material, then look at where this knowledge leads.

Timer Configuration Cheat Sheet

Desired Rate	PSC (80MHz clock)	ARR	Timer Tick
1 Hz	7999	9999	100µs
10 Hz	7999	999	100µs
100 Hz	799	999	10µs
1 kHz	79	999	1µs
10 kHz	7	999	100ns
100 kHz	0	799	12.5ns
1 MHz (PWM)	0	79	12.5ns

Interrupt Latency Reference

Scenario	Cycles	Time @ 80MHz
Normal entry	12	150 ns
Tail-chain	6	75 ns
Late-arriving (during stacking)	0 extra	Redirects
Return from ISR	12	150 ns
Wake from Sleep mode	12 + 1	~162 ns
Wake from STOP2	~264 cycles	3.3 µs

NVIC Priority Assignment Strategy

Rule of thumb for priority assignment:
Priority 0-1: Safety-critical (fault handlers, watchdog, emergency stop)
Priority 2-3: Hard real-time (motor control, ADC sampling, communication timeouts)
Priority 4-7: Firm real-time (display update, data logging, LED feedback)
Priority 8-15: Soft real-time (background housekeeping, statistics, debug output)

ARM Assembly Quick Reference

Instruction	Cycles	Effect
MOV Rd, #imm	1	Rd = immediate
LDR Rd, [Rn]	2	Rd = mem[Rn]
STR Rd, [Rn]	2	mem[Rn] = Rd
ADD/SUB	1	Arithmetic
MUL	1	32-bit multiply (Cortex-M4)
B/BEQ/BNE	1-3	Branch (pipeline flush)
BL	1+	Call (saves return addr in LR)
PUSH/POP	1+N	Stack N registers
WFI	1+	Sleep until interrupt

Design Challenge: Real-Time Motor Controller

Design a system with these requirements:

PID loop at 10kHz — read encoder position, compute error, output PWM correction every 100µs
Encoder input capture — TIM1 in encoder mode, counting quadrature pulses
PWM output at 20kHz — TIM2 in PWM mode, duty cycle = PID output
Safety timeout — if no control update in 500µs, emergency stop (set PWM to 0)
Telemetry at 100Hz — send position/velocity/current over UART

Key decisions: TIM4 @ 10kHz triggers PID ISR (priority 1). TIM1 counts encoder pulses continuously (no ISR needed — just read CNT). TIM2 generates PWM (no ISR needed — just update CCR1). A watchdog timer fires at 2kHz; if PID hasn't cleared its flag, kill the motor. UART telemetry uses DMA with ring buffer at priority 5.

PID update: PSC=7, ARR=999 → 80MHz / 8 / 1000 = 10kHz ✔

PWM frequency: PSC=0, ARR=3999 → 80MHz / 4000 = 20kHz ✔

PWM duty cycle: CCR1 = PID_output × 3999 / max_output

Comparison: Bare Metal vs RTOS

Aspect	Bare Metal (this lesson)	RTOS (FreeRTOS, Zephyr)
Timing control	Exact cycle counts	Tick-based (typically 1ms granularity)
Code size	Minimal (your code only)	+10-50KB kernel
Complexity	Simple for <5 tasks	Better for 10+ tasks
Worst-case latency	Predictable (you control everything)	Kernel overhead adds ~5-20µs
Concurrency model	Interrupts + main loop	Threads + semaphores + queues
When to use	Simple systems, maximum performance	Complex systems, many tasks

Where to Go Next

RTOS concepts — When bare metal isn't enough: task scheduling, mutexes, message queues
DMA deep-dive — Memory-to-peripheral and peripheral-to-memory transfers without CPU
Communication protocols — SPI, I2C, CAN bus at the register level
DSP on Cortex-M4 — CMSIS-DSP library, fixed-point arithmetic, SIMD instructions
Safety-critical systems — IEC 61508, MISRA C, watchdogs, ECC memory

"Premature optimization is the root of all evil, but we should never miss the critical 3%." — Donald Knuth. In embedded real-time systems, those 3% are the ISRs. Get them right, and the whole system works. Get them wrong, and nothing else matters.

You're designing a system with a 10kHz PID loop, 1kHz data logging, and 10Hz display update. Which timer should have the highest NVIC priority?

The 10kHz PID timer — it has the tightest deadline (100µs) The 1kHz data logger — data loss is unacceptable The 10Hz display — user experience matters most

Embedded Real-TimeControllers