The programming discipline where correctness depends on WHEN your code runs, not just what it computes.
An airbag must deploy in 10 milliseconds. Not "usually 10ms" — always 10ms. If it takes 11ms, someone dies. Your desktop computer doesn't care if a web page loads in 200ms or 250ms. But an airbag controller? A robotic arm? A pacemaker? For these systems, the time your code finishes is part of its correctness.
This is the fundamental difference. On your laptop, a program is "correct" if it computes the right answer. On an embedded real-time system, a program is correct only if it computes the right answer before its deadline. A perfect answer that arrives late is a wrong answer.
There are three categories of real-time constraints:
| Type | Deadline Miss Consequence | Example |
|---|---|---|
| Hard real-time | Catastrophic failure (death, destruction) | Airbag, pacemaker, fly-by-wire |
| Firm real-time | Result is worthless but no catastrophe | Video frame decode (dropped frame) |
| Soft real-time | Degraded quality but still usable | Audio streaming, UI responsiveness |
The key metric for hard real-time is WCET — Worst-Case Execution Time. Not the average. Not the typical case. The absolute worst case, considering every possible branch, every cache miss, every interrupt. If your WCET exceeds your deadline, your system is broken by design, even if it "usually" works.
Consider a simple control loop: read sensor, compute output, write actuator. If this loop must run at 1kHz (every 1ms), then ALL processing — sensor read, computation, actuator write — must complete within 1ms. Every. Single. Time. No garbage collection pauses. No page faults. No "just wait a moment while I resize this hash table."
This is why real-time systems use microcontrollers instead of operating systems like Linux. A microcontroller gives you:
Three tasks with deadlines. Watch what happens when Task B takes too long. Green = met deadline. Red = missed deadline.
Now that you understand WHY real-time matters, let's meet the hardware that makes it possible. The STM32L475 is an ARM Cortex-M4 microcontroller made by STMicroelectronics. It's the heart of the B-L475E-IOT01A discovery board — a popular development platform for IoT and embedded applications.
Why this chip specifically? Because it sits at the sweet spot: powerful enough for real signal processing (floating-point unit, DSP instructions, 80MHz clock), yet efficient enough to run on a coin cell battery (1.1µA in STOP2 mode). It's what you'd choose for a battery-powered sensor node that occasionally needs to crunch numbers fast.
| Feature | STM32L475 | Why It Matters |
|---|---|---|
| CPU | ARM Cortex-M4F @ 80MHz | Single-cycle multiply, hardware FPU, DSP extensions |
| Flash | 1 MB | Your program lives here (non-volatile) |
| SRAM | 128 KB | Variables, stack, heap (volatile) |
| FPU | Single-precision IEEE 754 | Hardware float in 1 cycle vs 20+ in software |
| Low-power | STOP2: 1.1µA | Years on a coin cell with periodic wake-up |
| Timers | 16 timers (2×32-bit, 14×16-bit) | PWM, input capture, periodic interrupts |
| ADC | 3×12-bit, 5 Msps | Read analog sensors (temperature, voltage, current) |
| Comms | 3×SPI, 3×I2C, 6×USART, USB | Talk to sensors, displays, radios, PCs |
The ARM Cortex-M4 uses a memory-mapped I/O architecture. This means peripherals (timers, GPIO, UART) appear at specific addresses in the same address space as RAM and Flash. Writing to address 0x48000014 doesn't write to RAM — it sets the output pins on GPIO port A. This is how you control hardware: by writing to magic addresses.
| Address Range | What Lives Here | Size |
|---|---|---|
0x0800_0000 | Flash (your program) | 1 MB |
0x2000_0000 | SRAM (your variables) | 128 KB |
0x4000_0000 | APB1 peripherals (TIM2-7, USART2-5, SPI2-3, I2C1-3) | - |
0x4001_0000 | APB2 peripherals (TIM1/8/15-17, USART1, SPI1, ADC) | - |
0x4002_0000 | AHB1 peripherals (DMA, RCC, Flash control) | - |
0x4800_0000 | AHB2 peripherals (GPIO A-H, ADC, RNG) | - |
0xE000_0000 | Cortex-M4 internals (NVIC, SysTick, debug) | - |
The STM32L475 has a complex clock system. The main system clock (SYSCLK) can come from multiple sources: an internal 4MHz MSI oscillator, an internal 16MHz HSI, or an external crystal (HSE). A PLL (Phase-Locked Loop) multiplies these up. For maximum performance: HSI16 → PLL → 80MHz SYSCLK.
Each peripheral bus has its own clock divider:
Before you can USE any peripheral, you must ENABLE its clock. The RCC (Reset and Clock Control) block controls which peripherals get a clock signal. Peripheral with clock disabled = completely dead, draws zero power.
Click on any peripheral block to see its base address and key features. The orange paths show the clock distribution.
Forget HAL. Forget Arduino. Forget every abstraction layer you've ever used. We're going bare metal. On a microcontroller, controlling hardware means writing specific values to specific memory addresses. These addresses are called registers — 32-bit locations that directly control hardware behavior.
Why bare metal? Because in real-time systems, you need to know exactly what your code does and exactly how long it takes. HAL functions hide complexity, add overhead, and make timing unpredictable. A single HAL_GPIO_WritePin() call might take 8-15 cycles depending on debug checks. A direct register write takes exactly 1 cycle.
The B-L475E-IOT01A board has an LED connected to pin PB14 (Port B, pin 14). To blink it, we need three steps: (1) enable GPIOB's clock, (2) configure pin 14 as output, (3) toggle the pin.
Step 1: Enable GPIOB clock (RCC_AHB2ENR)
The RCC AHB2 peripheral clock enable register lives at address 0x4002_104C. Bit 1 controls GPIOB's clock.
c // RCC base: 0x40021000 // AHB2ENR offset: 0x4C // Bit 1: GPIOBEN *(volatile uint32_t*)0x4002104C |= (1 << 1); // Equivalent: RCC->AHB2ENR |= RCC_AHB2ENR_GPIOBEN;
Step 2: Configure PB14 as general-purpose output (GPIOB_MODER)
The MODER register controls pin mode. Each pin uses 2 bits: 00=input, 01=output, 10=alternate function, 11=analog. Pin 14 occupies bits [29:28].
c // GPIOB base: 0x48000400 // MODER offset: 0x00 // Bits [29:28] for pin 14: set to 01 (output) volatile uint32_t* GPIOB_MODER = (volatile uint32_t*)0x48000400; *GPIOB_MODER &= ~(3 << 28); // Clear bits 29:28 *GPIOB_MODER |= (1 << 28); // Set to 01 (output)
Step 3: Toggle PB14 (GPIOB_ODR)
The Output Data Register (ODR) at offset 0x14 directly controls pin state. Bit 14 = pin 14.
c // GPIOB_ODR at 0x48000414 volatile uint32_t* GPIOB_ODR = (volatile uint32_t*)0x48000414; *GPIOB_ODR ^= (1 << 14); // XOR toggles the bit
c // Bare-metal LED blink — STM32L475, PB14 // No HAL, no libraries, no RTOS #include <stdint.h> #define RCC_AHB2ENR (*(volatile uint32_t*)0x4002104C) #define GPIOB_MODER (*(volatile uint32_t*)0x48000400) #define GPIOB_ODR (*(volatile uint32_t*)0x48000414) void delay(volatile uint32_t count) { while(count--); // ~3 cycles per iteration } int main(void) { // 1. Enable GPIOB clock RCC_AHB2ENR |= (1 << 1); // 2. Configure PB14 as output GPIOB_MODER &= ~(3 << 28); // Clear GPIOB_MODER |= (1 << 28); // Output mode // 3. Blink forever while(1) { GPIOB_ODR ^= (1 << 14); // Toggle LED delay(800000); // ~100ms at 80MHz } }
That's 10 lines of actual logic. No initialization framework, no HAL_Init(), no SystemClock_Config() abstraction. You understand every single byte that flows to the hardware.
There's a subtle problem with ODR ^= (1 << 14). It's a read-modify-write operation: read ODR, XOR with mask, write back. If an interrupt fires between the read and write, and that ISR also modifies ODR, you get a race condition. The solution is the BSRR (Bit Set/Reset Register):
c // GPIOB_BSRR at 0x48000418 // Bits [15:0] — write 1 to SET corresponding pin // Bits [31:16] — write 1 to RESET corresponding pin #define GPIOB_BSRR (*(volatile uint32_t*)0x48000418) GPIOB_BSRR = (1 << 14); // SET pin 14 (atomic, single write) GPIOB_BSRR = (1 << (14+16)); // RESET pin 14 (atomic, single write)
Click individual bits to set/clear them. Watch the hex value update. This is GPIOB_MODER — each pair of bits configures one pin's mode.
Sometimes C isn't enough. When you need cycle-precise timing, when you're writing the first instructions that run at boot (the reset handler), or when you need to understand exactly what the compiler generated — you need assembly. The Cortex-M4 uses the Thumb-2 instruction set: a mix of 16-bit and 32-bit instructions that balances code density with performance.
Don't panic. ARM assembly is remarkably readable compared to x86. Most instructions do exactly one thing: load, store, add, compare, branch. No cryptic prefixes, no segment registers, no stack machine weirdness.
The Cortex-M4 has 16 general-purpose 32-bit registers:
| Register | Name | Purpose |
|---|---|---|
| R0–R3 | Arguments / scratch | Function arguments, return value (R0), caller-saved |
| R4–R11 | Callee-saved | Preserved across function calls, must be saved/restored |
| R12 | IP (Intra-Procedure) | Scratch register, used by linker veneers |
| R13 | SP (Stack Pointer) | Points to top of stack (two banks: MSP and PSP) |
| R14 | LR (Link Register) | Return address for function calls (BL stores PC here) |
| R15 | PC (Program Counter) | Address of next instruction to execute |
arm @ Data movement MOV R0, #42 @ R0 = 42 (immediate value) MOV R1, R0 @ R1 = R0 (register to register) LDR R0, [R1] @ R0 = memory[R1] (load from address) STR R0, [R1] @ memory[R1] = R0 (store to address) LDR R0, =0x48000418 @ R0 = 0x48000418 (load constant) @ Arithmetic ADD R0, R1, R2 @ R0 = R1 + R2 SUB R0, R1, #1 @ R0 = R1 - 1 MUL R0, R1, R2 @ R0 = R1 * R2 (single cycle on M4!) @ Bitwise ORR R0, R0, #(1<<14) @ Set bit 14 BIC R0, R0, #(1<<14) @ Clear bit 14 (Bit Clear) EOR R0, R0, #(1<<14) @ Toggle bit 14 (XOR) @ Compare and branch CMP R0, #0 @ Compare R0 with 0 (sets flags) BEQ label @ Branch if equal (Z flag set) BNE label @ Branch if not equal BL function @ Branch with Link (call: saves PC to LR) BX LR @ Branch to LR (return from function)
Let's write the PB14 toggle entirely in assembly. This sets pin 14 high using BSRR:
arm @ Toggle PB14 via BSRR — 4 instructions, 4 cycles toggle_led: LDR R0, =0x48000418 @ R0 = address of GPIOB_BSRR LDR R1, [R0, #-4] @ R1 = GPIOB_ODR (0x414 = BSRR-4) EOR R1, R1, #(1<<14) @ Toggle bit 14 in our copy STR R1, [R0, #-4] @ Write back to ODR BX LR @ Return
Now compare to what the C compiler generates from GPIOB_ODR ^= (1 << 14); at -O2 optimization:
arm @ GCC -O2 output for GPIOB_ODR ^= (1 << 14) LDR R3, =0x48000414 @ Load ODR address LDR R2, [R3] @ Read current ODR value EOR R2, R2, #16384 @ XOR with (1<<14) = 16384 STR R2, [R3] @ Write back
When calling a function from C or from another assembly routine, ARM follows strict rules:
Watch registers change as each instruction executes. Orange = just modified. Click Step to advance.
BL myFunction, what does the LR (R14) register contain?Timers are the heartbeat of real-time systems. They generate periodic interrupts ("wake me up every 1ms"), measure external signal timing (input capture), and produce precise output waveforms (PWM). The STM32L475 has 16 timers. We'll focus on TIM2 — a 32-bit general-purpose timer clocked at up to 80MHz.
A timer is surprisingly simple at its core: it's just a counter that increments every clock tick. When the counter reaches a programmed value, it resets to zero and optionally fires an interrupt. That's it. The complexity comes from the many ways you can configure the clock source, counting direction, and output behavior.
TIM2 base address: 0x4000_0000. The key registers:
| Register | Offset | Purpose |
|---|---|---|
| CR1 | 0x00 | Control register 1 — enable timer, set counting mode |
| DIER | 0x0C | DMA/Interrupt enable — which events generate interrupts |
| SR | 0x10 | Status register — which events have occurred (clear by writing 0) |
| CNT | 0x24 | Counter value — the actual 32-bit count |
| PSC | 0x28 | Prescaler — divides input clock by (PSC+1) |
| ARR | 0x2C | Auto-reload — counter resets when it reaches this value |
Goal: TIM2 fires an interrupt every 1ms (1kHz). The timer clock is 80MHz.
We want finterrupt = 1000 Hz. So:
Choose PSC = 79, ARR = 999:
Verify: 80MHz / 80,000 = 1000 Hz = 1ms period. Perfect.
c // Configure TIM2 for 1ms interrupt at 80MHz // 1. Enable TIM2 clock (RCC APB1ENR1, bit 0) *(volatile uint32_t*)0x40021058 |= (1 << 0); // RCC_APB1ENR1 |= TIM2EN // 2. Set prescaler: divide 80MHz by 80 → 1MHz tick *(volatile uint32_t*)0x40000028 = 79; // TIM2_PSC = 79 // 3. Set auto-reload: count 1000 ticks → 1ms *(volatile uint32_t*)0x4000002C = 999; // TIM2_ARR = 999 // 4. Enable update interrupt (DIER bit 0 = UIE) *(volatile uint32_t*)0x4000000C |= (1 << 0); // TIM2_DIER |= UIE // 5. Enable timer (CR1 bit 0 = CEN) *(volatile uint32_t*)0x40000000 |= (1 << 0); // TIM2_CR1 |= CEN // 6. Enable TIM2 interrupt in NVIC (IRQ #28) *(volatile uint32_t*)0xE000E100 |= (1 << 28); // NVIC_ISER0 bit 28
After enabling (CEN=1), the hardware does this in an infinite loop:
Watch the counter increment, hit ARR, reset, and fire an interrupt. Adjust PSC and ARR to change the timing.
Your timer is counting. When it hits ARR, it needs to tell the CPU "hey, time's up!" It can't just wait for the CPU to check — that's polling, and polling wastes cycles. Instead, the timer sends an interrupt: an asynchronous hardware signal that forces the CPU to immediately stop what it's doing and jump to a handler function.
The NVIC (Nested Vectored Interrupt Controller) is the traffic cop. It receives interrupt requests from all 82 possible sources on the STM32L475 (timers, GPIO, UART, DMA, ADC...) and decides which one the CPU handles first, based on priority.
On the Cortex-M4, the time from interrupt assertion to first ISR instruction is 12 cycles (150ns at 80MHz). This includes stacking 8 registers. The NVIC also supports tail-chaining: if another interrupt is pending when an ISR returns, the CPU skips the unstack/restack sequence and jumps directly to the next handler in just 6 cycles.
| Event | Cycles | Time @ 80MHz |
|---|---|---|
| Interrupt entry (stacking + fetch) | 12 | 150 ns |
| Interrupt return (unstacking) | 12 | 150 ns |
| Tail-chain (back-to-back ISRs) | 6 | 75 ns |
| Late-arriving (higher priority during stack) | 0 extra | Redirects immediately |
The STM32L475 uses 4 bits for priority (values 0–15, where 0 is highest priority). These 4 bits are split into preemption priority and sub-priority using a configurable group setting. With the default grouping (4 bits preempt, 0 sub):
c // Set TIM2 interrupt (IRQ #28) to priority 2 // NVIC_IPR registers at 0xE000E400, one byte per IRQ // Priority in top 4 bits of the byte *(volatile uint8_t*)(0xE000E400 + 28) = (2 << 4); // Set EXTI0 (IRQ #6) to priority 1 (higher than TIM2) *(volatile uint8_t*)(0xE000E400 + 6) = (1 << 4);
c // Configure EXTI0 for falling edge on PA0 (user button) // 1. Enable GPIOA clock *(volatile uint32_t*)0x4002104C |= (1 << 0); // RCC_AHB2ENR bit 0 // 2. Enable SYSCFG clock (needed for EXTI mux) *(volatile uint32_t*)0x40021060 |= (1 << 0); // RCC_APB2ENR bit 0 // 3. Map EXTI0 to PA0 (SYSCFG_EXTICR1, bits [3:0] = 0000 = Port A) *(volatile uint32_t*)0x40010008 &= ~0xF; // SYSCFG_EXTICR1 bits[3:0] = PA // 4. Configure falling edge trigger (EXTI_FTSR1 bit 0) *(volatile uint32_t*)0x4001000C |= (1 << 0); // EXTI_FTSR1 // 5. Unmask EXTI0 (EXTI_IMR1 bit 0) *(volatile uint32_t*)0x40010000 |= (1 << 0); // EXTI_IMR1 // 6. Enable EXTI0 in NVIC (IRQ #6) *(volatile uint32_t*)0xE000E100 |= (1 << 6); // NVIC_ISER0 // 7. ISR handler (name must match vector table) void EXTI0_IRQHandler(void) { // Clear pending bit (write 1 to clear!) *(volatile uint32_t*)0x40010014 = (1 << 0); // EXTI_PR1 // Do something (toggle LED, set flag, etc.) GPIOB_ODR ^= (1 << 14); }
Multiple interrupts fire at different times. Watch how the NVIC handles preemption and tail-chaining. Lower number = higher priority.
Here's the golden rule of interrupt service routines: get in, do the minimum, get out. Every cycle you spend inside an ISR is a cycle where lower-priority interrupts are blocked. A long ISR doesn't just slow down your system — it can cause other interrupts to miss their deadlines.
What's "the minimum"? Set a flag. Copy one byte to a buffer. Start a DMA transfer. That's it. Never: allocate memory, call printf, do floating-point math, or loop over arrays inside an ISR.
The ISR sets a volatile flag. The main loop checks and clears it. The word volatile tells the compiler "this variable can change at any time outside normal program flow — never optimize away reads of it."
c volatile uint8_t timer_flag = 0; void TIM2_IRQHandler(void) { TIM2_SR &= ~(1 << 0); // Clear UIF (2 cycles) timer_flag = 1; // Set flag (1 cycle) } // Total ISR: ~5 cycles = 62.5ns int main(void) { // ... setup ... while(1) { if(timer_flag) { timer_flag = 0; // Do the heavy processing here (main context) process_sensor_data(); update_display(); } } }
When data arrives byte-by-byte (UART, SPI), you need a buffer. A ring buffer (circular buffer) lets the ISR write and the main loop read without blocking each other, as long as the buffer doesn't overflow.
c #define BUF_SIZE 64 // Must be power of 2 for fast modulo volatile uint8_t buf[BUF_SIZE]; volatile uint8_t head = 0; // ISR writes here volatile uint8_t tail = 0; // Main reads here void USART1_IRQHandler(void) { uint8_t byte = USART1_RDR; // Read received byte buf[head] = byte; // Store in buffer head = (head + 1) & (BUF_SIZE - 1); // Advance head (wraps) } // Main loop reads when data available while(tail != head) { uint8_t data = buf[tail]; tail = (tail + 1) & (BUF_SIZE - 1); process(data); }
head % BUF_SIZE is slow (division). But head & (BUF_SIZE - 1) is a single AND instruction when BUF_SIZE is a power of 2. In an ISR, every cycle counts.When you're sampling at high rates (e.g., audio at 48kHz), you can't process each sample individually. Instead: fill one buffer while processing the other, then swap. The ISR fills Buffer A, triggers a "buffer full" flag, main processes A while ISR fills B, repeat.
c #define BLOCK_SIZE 256 volatile int16_t bufA[BLOCK_SIZE]; volatile int16_t bufB[BLOCK_SIZE]; volatile int16_t* fill_buf = bufA; // ISR writes here volatile int16_t* proc_buf = bufB; // Main reads here volatile uint16_t fill_idx = 0; volatile uint8_t buffer_ready = 0; void ADC1_IRQHandler(void) { fill_buf[fill_idx++] = ADC1_DR; // Read sample if(fill_idx >= BLOCK_SIZE) { fill_idx = 0; // Swap buffers volatile int16_t* tmp = fill_buf; fill_buf = proc_buf; proc_buf = tmp; buffer_ready = 1; } }
| Anti-pattern | Why It's Fatal | Correct Alternative |
|---|---|---|
printf() | Calls malloc, UART waits, 1000+ cycles | Set flag, print in main |
malloc()/free() | Non-deterministic time, can fragment | Pre-allocate all buffers |
for(i=0; i<1000;...) | Blocks all lower-priority interrupts | Process one element per ISR call |
| Float math | FPU context save adds 17 cycles entry | Use fixed-point in ISR, float in main |
| Forget to clear flag | ISR re-enters immediately = system hang | ALWAYS clear flag first thing |
Compare a well-designed ISR (flag + deferred processing) vs a bloated ISR (all processing inline). Watch how the long ISR blocks subsequent interrupts.
Time to put it all together. We'll build a complete real-time data acquisition system: three timers running at different rates, an ADC sampling sensor data, DMA transferring results, and a main loop that processes data when buffers are full. This is exactly how a real embedded sensor node works.
| Timer | Rate | PSC | ARR | Purpose | Priority |
|---|---|---|---|---|---|
| TIM2 | 1 kHz | 79 | 999 | ADC trigger (sampling) | 1 (highest) |
| TIM3 | 10 Hz | 7999 | 999 | Display update | 3 |
| TIM7 | 1 Hz | 7999 | 9999 | Heartbeat LED | 7 (lowest) |
Verification: TIM3 = 80MHz / (8000 × 1000) = 10Hz. TIM7 = 80MHz / (8000 × 10000) = 1Hz. Correct.
The ADC sampling MUST happen exactly on time (hard real-time) — a jittered sample corrupts the frequency content of the signal. Display update can slip a few ms without anyone noticing (soft real-time). The heartbeat LED is purely cosmetic. Priority reflects criticality.
c // Complete real-time data acquisition system // STM32L475 @ 80MHz, bare metal #define BLOCK_SIZE 256 volatile uint16_t adc_bufA[BLOCK_SIZE]; volatile uint16_t adc_bufB[BLOCK_SIZE]; volatile uint16_t* adc_fill = adc_bufA; volatile uint16_t* adc_proc = adc_bufB; volatile uint16_t adc_idx = 0; volatile uint8_t data_ready = 0; volatile uint8_t display_flag = 0; volatile uint32_t sample_count = 0; // TIM2 ISR: 1kHz ADC sampling (priority 1) void TIM2_IRQHandler(void) { TIM2_SR = 0; // Clear ALL flags (fast: single write) ADC1_CR |= (1 << 2); // Start ADC conversion (ADSTART) } // ADC ISR: conversion complete void ADC1_IRQHandler(void) { adc_fill[adc_idx] = ADC1_DR; // Read clears EOC flag adc_idx++; sample_count++; if(adc_idx >= BLOCK_SIZE) { adc_idx = 0; volatile uint16_t* tmp = adc_fill; adc_fill = adc_proc; adc_proc = tmp; data_ready = 1; } } // TIM3 ISR: 10Hz display update (priority 3) void TIM3_IRQHandler(void) { TIM3_SR = 0; display_flag = 1; } // TIM7 ISR: 1Hz heartbeat LED (priority 7) void TIM7_IRQHandler(void) { TIM7_SR = 0; GPIOB_ODR ^= (1 << 14); // Toggle LED } int main(void) { setup_clocks(); // 80MHz from PLL setup_gpio(); // PB14 output setup_adc(); // ADC1 channel, 12-bit setup_tim2(); // 1kHz setup_tim3(); // 10Hz setup_tim7(); // 1Hz while(1) { if(data_ready) { data_ready = 0; process_block(adc_proc, BLOCK_SIZE); } if(display_flag) { display_flag = 0; update_display(sample_count); } __WFI(); // Sleep until next interrupt } }
| ISR | Cycles | Time | % of Period |
|---|---|---|---|
| TIM2 (1kHz) | ~10 | 125ns | 0.013% |
| ADC1 (1kHz) | ~20 | 250ns | 0.025% |
| TIM3 (10Hz) | ~8 | 100ns | <0.001% |
| TIM7 (1Hz) | ~10 | 125ns | <0.001% |
| Total overhead | <0.05% |
The CPU spends 99.95% of its time either sleeping (WFI) or in the main loop processing data. The interrupt overhead is negligible because we followed the "short ISR" pattern.
Full system simulation. Three timers fire at different rates. ADC samples fill a buffer. Main loop processes when full. Adjust timer periods and watch the system respond. Red flash = deadline miss.
Here's a paradox: real-time systems must respond instantly, but many of them run on batteries. A sensor node that wakes up every second to read temperature, then sleeps for 999ms, can run for years on a coin cell. The trick is the WFI (Wait For Interrupt) instruction: it halts the CPU until the next interrupt fires. Zero power consumption while waiting, instant wake-up when needed.
The STM32L475 is designed for exactly this use case. Its STOP2 mode draws only 1.1 microamps while keeping SRAM and registers alive. Wake-up time from STOP2 is about 3.3µs — fast enough for most applications.
| Mode | Current | Wake-up Time | What's Preserved | Wake Sources |
|---|---|---|---|---|
| Run | 100 µA/MHz | - | Everything | - |
| Sleep | ~25 µA/MHz | ~1 µs | All, CPU halted | Any interrupt |
| STOP2 | 1.1 µA | 3.3 µs | SRAM, regs, RTC | EXTI, RTC, LPTIM |
| Standby | 0.3 µA | 50 µs | Backup regs only | WKUP pins, RTC |
| Shutdown | 0.03 µA | ~ms | Nothing | WKUP pins |
c // Simple sleep-between-interrupts pattern // CPU runs at 80MHz only during ISR + main processing // Sleeps at ~2mA the rest of the time int main(void) { setup_all_peripherals(); while(1) { if(data_ready) { data_ready = 0; process_data(); // Runs at 80MHz, takes ~5ms } __WFI(); // ARM instruction: Wait For Interrupt // CPU sleeps here until ANY enabled interrupt fires // Wake-up is instant (1 cycle latency) } }
The standard approach uses SysTick (a periodic 1ms interrupt) for timekeeping. But SysTick wakes the CPU 1000 times per second even when there's nothing to do. Tickless idle stops SysTick entirely and programs the RTC alarm for the next scheduled event:
c // Tickless idle: sleep until next event, not next tick uint32_t next_event_ms = get_next_scheduled_time(); uint32_t sleep_duration = next_event_ms - current_time_ms; // Disable SysTick SYSTICK_CTRL &= ~(1 << 0); // Program RTC wake-up timer RTC_WUTR = sleep_duration; // Wake in sleep_duration ms RTC_CR |= (1 << 10); // Enable wake-up timer // Enter STOP2 mode PWR_CR1 |= (1 << 0); // LPMS = STOP2 SCB_SCR |= (1 << 2); // SLEEPDEEP = 1 __WFI(); // --- wake up here --- // Restore clocks (STOP2 resets to MSI 4MHz) restore_80mhz_clock(); // Update time accounting current_time_ms += actual_sleep_duration(); // Re-enable SysTick if needed SYSTICK_CTRL |= (1 << 0);
A sensor node that samples at 1Hz, processes for 5ms, then sleeps in STOP2:
With a CR2032 coin cell (225 mAh):
Compare to always-on at 80MHz (8mA): 225mAh / 8mA = 28 hours. Sleep mode gives you 195x improvement.
Active bursts (high current) separated by sleep periods. Adjust the wake rate and processing time to see the effect on battery life.
You now understand the full stack: from real-time deadlines down to individual register bits. Let's consolidate with reference material, then look at where this knowledge leads.
| Desired Rate | PSC (80MHz clock) | ARR | Timer Tick |
|---|---|---|---|
| 1 Hz | 7999 | 9999 | 100µs |
| 10 Hz | 7999 | 999 | 100µs |
| 100 Hz | 799 | 999 | 10µs |
| 1 kHz | 79 | 999 | 1µs |
| 10 kHz | 7 | 999 | 100ns |
| 100 kHz | 0 | 799 | 12.5ns |
| 1 MHz (PWM) | 0 | 79 | 12.5ns |
| Scenario | Cycles | Time @ 80MHz |
|---|---|---|
| Normal entry | 12 | 150 ns |
| Tail-chain | 6 | 75 ns |
| Late-arriving (during stacking) | 0 extra | Redirects |
| Return from ISR | 12 | 150 ns |
| Wake from Sleep mode | 12 + 1 | ~162 ns |
| Wake from STOP2 | ~264 cycles | 3.3 µs |
| Instruction | Cycles | Effect |
|---|---|---|
| MOV Rd, #imm | 1 | Rd = immediate |
| LDR Rd, [Rn] | 2 | Rd = mem[Rn] |
| STR Rd, [Rn] | 2 | mem[Rn] = Rd |
| ADD/SUB | 1 | Arithmetic |
| MUL | 1 | 32-bit multiply (Cortex-M4) |
| B/BEQ/BNE | 1-3 | Branch (pipeline flush) |
| BL | 1+ | Call (saves return addr in LR) |
| PUSH/POP | 1+N | Stack N registers |
| WFI | 1+ | Sleep until interrupt |
Design a system with these requirements:
Key decisions: TIM4 @ 10kHz triggers PID ISR (priority 1). TIM1 counts encoder pulses continuously (no ISR needed — just read CNT). TIM2 generates PWM (no ISR needed — just update CCR1). A watchdog timer fires at 2kHz; if PID hasn't cleared its flag, kill the motor. UART telemetry uses DMA with ring buffer at priority 5.
| Aspect | Bare Metal (this lesson) | RTOS (FreeRTOS, Zephyr) |
|---|---|---|
| Timing control | Exact cycle counts | Tick-based (typically 1ms granularity) |
| Code size | Minimal (your code only) | +10-50KB kernel |
| Complexity | Simple for <5 tasks | Better for 10+ tasks |
| Worst-case latency | Predictable (you control everything) | Kernel overhead adds ~5-20µs |
| Concurrency model | Interrupts + main loop | Threads + semaphores + queues |
| When to use | Simple systems, maximum performance | Complex systems, many tasks |