Embedded Systems Engineering

Embedded Controller
Programming

The embedded software stack, hardware peripherals, optimization, and IoT design — from register bits to shipping firmware.

Prerequisites: C programming basics + Binary/hex notation. That's it.
10
Chapters
10+
Simulations
0
Assumed Knowledge

Chapter 0: The Embedded Software Stack

On your laptop, you write Python. You call print() and text appears. You call open("file.txt") and the OS finds a disk, locates the file system, manages memory, schedules your process among hundreds of others. You never think about how hardware works because the operating system is a thick blanket between you and the metal.

On a microcontroller, there is no OS. There is no print(). There is no file system. There is no memory manager. There is no scheduler. You ARE the operating system. Your C code talks directly to hardware through memory-mapped registers — specific addresses in memory that, when written to, physically change the behavior of silicon.

The fundamental shift: In desktop programming, you ask the OS to do things. In embedded programming, you configure hardware directly by writing specific bit patterns to specific memory addresses. There is no intermediary. You control every electron.

But raw register-bashing everywhere creates unmaintainable spaghetti. So embedded engineers organize code into layers:

Application
Your logic: read sensor, make decision, actuate
Middleware
Protocols, state machines, data processing
Drivers
Peripheral-specific logic (UART driver, SPI driver)
HAL (Hardware Abstraction Layer)
Register access wrapped in portable functions
Hardware
MCU silicon: GPIO pins, timers, ADC, DMA, buses

Why bother with layers? Portability. If you swap from an STM32F4 to an STM32L4, you only rewrite the HAL. Drivers, middleware, and application code stay the same. Without layers, changing MCU means rewriting everything.

Here is what a raw register write looks like vs. a HAL call:

c
// Raw register: turn on LED on PA5 (STM32F4)
*(volatile uint32_t*)0x40020014 |= (1 << 5);  // GPIOA->ODR bit 5

// HAL equivalent: portable, readable
HAL_GPIO_WritePin(GPIOA, GPIO_PIN_5, GPIO_PIN_SET);

Both do the exact same thing: set bit 5 of the GPIOA output data register to 1, driving pin PA5 high, lighting the LED. But the HAL version works on any STM32 family without change.

Key insight: Every peripheral on an MCU is just a collection of registers at fixed memory addresses. A "driver" is just code that writes the right bits to the right addresses in the right order. The datasheet tells you which addresses and which bits. That's ALL embedded programming is.

Let's see this layered architecture interactively. Click each layer below to see what it does and what it calls:

Embedded Software Stack

Click any layer to see its responsibilities and API boundaries.

Notice how each layer only talks to the one directly below it. The application never writes to a register directly. The HAL never makes application-level decisions. This discipline is what makes firmware maintainable as projects grow to 50,000+ lines.

LayerExample FunctionTouches Hardware?
Applicationread_temperature()No
Middlewarefilter_samples(buf, len)No
Driveri2c_read(addr, reg, data, len)Via HAL
HALHAL_I2C_Mem_Read(...)Yes (registers)
HardwareI2C peripheral siliconIS hardware
Why do embedded engineers organize code into layers (HAL, Drivers, Middleware, Application)?

Chapter 1: Startup Code & Initialization

You write int main(void) { ... } and assume it runs. But what happens before main? On your laptop, the OS loads your binary, sets up the stack, initializes libc, and jumps to main. On a microcontroller, there is no OS to do this. A small piece of assembly code — the startup code — does it instead.

When you press reset (or power on), the Cortex-M CPU does exactly two things:

1. Load SP from address 0x00000000
The first 4 bytes of Flash contain the initial stack pointer value
2. Load PC from address 0x00000004
The next 4 bytes contain the address of Reset_Handler
3. Execute Reset_Handler
Startup assembly code begins running

This is the vector table — a table of addresses stored at the very beginning of Flash memory. The first entry is the initial stack pointer. The second is the reset handler address. Entries 3-256+ are interrupt handler addresses (we'll use those later).

c
// Vector table (simplified) — lives at address 0x08000000 in Flash
uint32_t vectors[] __attribute__((section(".isr_vector"))) = {
    (uint32_t)&_estack,        // 0x00: Initial Stack Pointer (top of SRAM)
    (uint32_t)Reset_Handler,   // 0x04: Reset handler address
    (uint32_t)NMI_Handler,     // 0x08: Non-maskable interrupt
    (uint32_t)HardFault_Handler,// 0x0C: Hard fault
    // ... more interrupt vectors ...
};

The Reset_Handler is startup assembly code that performs these steps in order:

Boot sequence (Cortex-M):
1. Set stack pointer (already done by hardware from vector table)
2. Copy .data section from Flash to SRAM (initialized global variables)
3. Zero the .bss section in SRAM (uninitialized globals = 0)
4. Call SystemInit() — configure clock system
5. Call __libc_init_array() — C++ constructors (if any)
6. Call main()

Why copy .data from Flash? Because global variables like int counter = 42; need their initial value (42) stored somewhere permanent (Flash), but they live in SRAM at runtime so they can be modified. The startup code copies these initial values from Flash to SRAM.

Why zero .bss? The C standard guarantees that uninitialized globals start at zero. The startup code enforces this by memset-ing the .bss region to 0x00.

arm-asm
/* Simplified Reset_Handler (ARM Cortex-M) */
Reset_Handler:
    /* Copy .data from Flash to SRAM */
    ldr  r0, =_sdata      @ destination start (SRAM)
    ldr  r1, =_edata      @ destination end
    ldr  r2, =_sidata     @ source (Flash)
copy_loop:
    cmp  r0, r1
    bge  zero_bss
    ldr  r3, [r2], #4    @ load word from Flash, advance
    str  r3, [r0], #4    @ store word to SRAM, advance
    b    copy_loop

zero_bss:
    /* Zero .bss in SRAM */
    ldr  r0, =_sbss       @ bss start
    ldr  r1, =_ebss       @ bss end
    movs r2, #0
bss_loop:
    cmp  r0, r1
    bge  call_main
    str  r2, [r0], #4    @ store zero, advance
    b    bss_loop

call_main:
    bl   SystemInit        @ configure clocks
    bl   main              @ jump to your code!
    b    .                 @ infinite loop if main returns
Key insight: If main() ever returns on a microcontroller, there's nothing to return TO. No OS to catch you. The startup code has an infinite loop after bl main as a safety net. In practice, your main() should contain while(1) { ... } and never return.

Here is the memory layout at boot. Watch the startup code copy .data and zero .bss before jumping to main:

Boot Sequence Visualizer

Click "Boot" to watch the startup sequence step by step. Each phase lights up as it executes.

Memory RegionAddress Range (STM32F4)Contents
Flash0x0800 0000 – 0x080F FFFFCode + .data init values + vector table
SRAM0x2000 0000 – 0x2001 FFFF.data (copied) + .bss (zeroed) + heap + stack
Peripherals0x4000 0000 – 0x5FFF FFFFMemory-mapped registers
Cortex-M Core0xE000 0000 – 0xE00F FFFFNVIC, SysTick, debug registers
What is stored at address 0x00000000 in the vector table of a Cortex-M MCU?

Chapter 2: Clock Configuration

Every digital circuit needs a clock — a periodic signal that tells transistors when to evaluate their inputs. The MCU's clock determines how fast instructions execute, how fast peripherals run, and how much power is consumed. A faster clock = faster execution but more power. Embedded systems carefully configure clocks to balance speed and battery life.

A typical STM32 has multiple clock sources:

SourceFrequencyAccuracyUse Case
HSI (High-Speed Internal)8-16 MHz±1-2%Fast boot, no external parts
HSE (High-Speed External)4-26 MHz crystal±20 ppmPrecise timing, USB, radio
LSI (Low-Speed Internal)32 kHz±5%Watchdog timer
LSE (Low-Speed External)32.768 kHz crystal±20 ppmRTC (real-time clock)
PLL (Phase-Locked Loop)Up to 480 MHzDerived from HSI/HSEMaximum CPU speed

The PLL is the key. It takes a reference clock (HSI or HSE) and multiplies it up to a much higher frequency. The formula:

fVCO = finput × (PLL_N / PLL_M)
fSYSCLK = fVCO / PLL_P

The clock tree distributes SYSCLK to all subsystems via prescalers (dividers):

HSE (8 MHz crystal)
Precise external source
↓ ÷ PLL_M
PLL Input (1-2 MHz)
Must be 1-2 MHz for stability
↓ × PLL_N
VCO (100-432 MHz)
Internal oscillator
↓ ÷ PLL_P
SYSCLK (up to 168 MHz)
CPU clock
↓ prescalers
AHB → APB1 (42 MHz) / APB2 (84 MHz)
Peripheral bus clocks
Worked example: Configure STM32F4 for 168 MHz from an 8 MHz crystal:
• PLL_M = 8 → PLL input = 8/8 = 1 MHz
• PLL_N = 336 → VCO = 1 × 336 = 336 MHz
• PLL_P = 2 → SYSCLK = 336/2 = 168 MHz
• AHB prescaler = 1 → HCLK = 168 MHz
• APB1 prescaler = 4 → APB1 = 42 MHz
• APB2 prescaler = 2 → APB2 = 84 MHz
c
// STM32F4 clock configuration for 168 MHz
void SystemClock_Config(void) {
    // Enable HSE
    RCC->CR |= RCC_CR_HSEON;
    while (!(RCC->CR & RCC_CR_HSERDY));  // Wait for HSE ready

    // Configure PLL: source=HSE, M=8, N=336, P=2, Q=7
    RCC->PLLCFGR = RCC_PLLCFGR_PLLSRC_HSE
                 | (8  << RCC_PLLCFGR_PLLM_Pos)   // M = 8
                 | (336 << RCC_PLLCFGR_PLLN_Pos)   // N = 336
                 | (0  << RCC_PLLCFGR_PLLP_Pos)   // P = 2 (0 = /2)
                 | (7  << RCC_PLLCFGR_PLLQ_Pos);  // Q = 7 (for USB)

    // Enable PLL
    RCC->CR |= RCC_CR_PLLON;
    while (!(RCC->CR & RCC_CR_PLLRDY));  // Wait for PLL lock

    // Configure Flash latency (5 wait states for 168 MHz)
    FLASH->ACR = FLASH_ACR_LATENCY_5WS | FLASH_ACR_PRFTEN | FLASH_ACR_ICEN;

    // Set bus prescalers: AHB=/1, APB1=/4, APB2=/2
    RCC->CFGR = RCC_CFGR_HPRE_DIV1
              | RCC_CFGR_PPRE1_DIV4
              | RCC_CFGR_PPRE2_DIV2
              | RCC_CFGR_SW_PLL;         // Switch SYSCLK to PLL

    while ((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_PLL); // Wait for switch
}
Flash wait states: Flash memory is slower than the CPU. At 168 MHz, the CPU needs data every 6 ns, but Flash takes ~30 ns. Solution: insert 5 "wait states" (the CPU pauses 5 extra cycles per Flash read) and enable prefetch + instruction cache to hide the latency.

Try the interactive clock tree below. Adjust PLL multipliers and see all derived clocks update in real-time:

Interactive Clock Tree

Adjust PLL_M, PLL_N, and PLL_P to see how SYSCLK and bus clocks change. Red = out of spec.

PLL_M 8
PLL_N 336
PLL_P 2
If HSE = 8 MHz, PLL_M = 4, PLL_N = 168, PLL_P = 2, what is SYSCLK?

Chapter 3: DMA — Direct Memory Access

Imagine you're receiving 1024 bytes over SPI from a sensor. Without DMA, the CPU must execute a loop: read one byte from the SPI data register, store it in a buffer, repeat 1024 times. The CPU is fully occupied doing glorified copying — it can't process data, update displays, or handle other events during this time.

DMA (Direct Memory Access) is a hardware peripheral that copies data between memory and peripherals without CPU involvement. You tell the DMA controller: "Copy 1024 bytes from address X to address Y, then interrupt me when done." The CPU is free to do other work while the DMA engine handles the transfer in the background.

The analogy: Without DMA, you (the CPU) are a delivery driver carrying packages one at a time from the warehouse (peripheral) to the store (memory). With DMA, you hire a conveyor belt (DMA controller) to move packages automatically while you go do useful work. When the belt finishes, it rings a bell (interrupt).

DMA has several modes:

ModeDescriptionUse Case
NormalTransfer N items, stop, interruptOne-shot sensor reads
CircularTransfer N items, auto-restart from beginningContinuous ADC sampling
Double-bufferAlternate between two buffers automaticallyAudio streaming (process buf A while filling buf B)
Transfer DirectionSourceDestinationExample
Peripheral → MemoryADC_DR, SPI_DRSRAM bufferADC samples to buffer
Memory → PeripheralSRAM bufferSPI_DR, DAC_DHRDisplay framebuffer to SPI
Memory → MemorySRAM region ASRAM region BFast memcpy in hardware
c
// Configure DMA2 Stream0 to transfer 1024 ADC samples to SRAM buffer
uint16_t adc_buffer[1024];

void DMA_ADC_Init(void) {
    // Enable DMA2 clock
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;

    // Disable stream before configuring
    DMA2_Stream0->CR &= ~DMA_SxCR_EN;
    while (DMA2_Stream0->CR & DMA_SxCR_EN); // Wait until disabled

    // Configure stream
    DMA2_Stream0->CR = (0 << DMA_SxCR_CHSEL_Pos)  // Channel 0 (ADC1)
                     | DMA_SxCR_MSIZE_0           // Memory size: 16-bit
                     | DMA_SxCR_PSIZE_0           // Peripheral size: 16-bit
                     | DMA_SxCR_MINC              // Memory address increment
                     | DMA_SxCR_CIRC              // Circular mode
                     | DMA_SxCR_TCIE;             // Transfer complete interrupt

    // Set addresses and count
    DMA2_Stream0->PAR  = (uint32_t)&ADC1->DR;       // Source: ADC data reg
    DMA2_Stream0->M0AR = (uint32_t)adc_buffer;      // Dest: our buffer
    DMA2_Stream0->NDTR = 1024;                     // Number of transfers

    // Enable stream
    DMA2_Stream0->CR |= DMA_SxCR_EN;
}
Critical detail — MINC vs PINC: We set MINC (memory increment) so each transfer goes to the next buffer slot. We do NOT set PINC (peripheral increment) because we always read from the same ADC data register address. Getting this wrong = data all in one slot, or reading garbage addresses.
DMA Transfer Animation

Watch bytes flow from the ADC peripheral to memory via DMA while the CPU independently runs application code.

The performance difference is dramatic. For 1024 16-bit ADC samples at 84 MHz APB2:

MethodCPU Cycles UsedCPU Availability
Polling (no DMA)~10,240 cycles0% during transfer
Interrupt per sample~30 cycles × 1024 = 30,720Intermittent (context switch overhead)
DMA~50 cycles (setup + ISR)99.5% during transfer
In circular DMA mode, what happens when the transfer counter reaches zero?

Chapter 4: Advanced Peripheral Interfacing

Now we combine everything: DMA, interrupts, and driver architecture to build real peripheral interfaces. Every production driver follows the same pattern: init → configure → start → ISR → callback. Let's build three complete drivers.

The driver pattern:
1. Init: Enable clock, configure GPIO pins, set peripheral registers
2. Configure: Set speed, mode, DMA/interrupt enables
3. Start: Enable peripheral, begin transfer
4. ISR: Hardware calls your interrupt handler when events occur
5. Callback: ISR calls application code (transfer complete, error, etc.)

SPI with DMA — OLED Display Driver

SPI (Serial Peripheral Interface) is a high-speed (up to 50 MHz) full-duplex bus with 4 wires: MOSI (data out), MISO (data in), SCK (clock), CS (chip select). Perfect for displays because it's fast and you're usually only sending data (display doesn't talk back much).

c
// SPI1 driver for SSD1306 OLED (128x64, monochrome)
// Pins: PA5=SCK, PA7=MOSI, PA4=CS, PA3=DC (data/command), PA2=RST

#define SSD1306_WIDTH   128
#define SSD1306_HEIGHT  64
#define SSD1306_BUFSIZE (SSD1306_WIDTH * SSD1306_HEIGHT / 8)  // 1024 bytes

static uint8_t framebuffer[SSD1306_BUFSIZE];
static volatile uint8_t dma_busy = 0;

void SPI1_Init(void) {
    // Enable clocks
    RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
    RCC->APB2ENR |= RCC_APB2ENR_SPI1EN;
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;

    // Configure PA5, PA7 as AF5 (SPI1), PA4/PA3/PA2 as output
    GPIOA->MODER |= (2 << 10) | (2 << 14)   // PA5, PA7 = alternate function
                  | (1 << 8) | (1 << 6) | (1 << 4); // PA4,3,2 = output
    GPIOA->AFR[0] |= (5 << 20) | (5 << 28);   // AF5 for PA5, PA7

    // SPI config: master, 8-bit, CPOL=0, CPHA=0, baud=/4 (21 MHz)
    SPI1->CR1 = SPI_CR1_MSTR | SPI_CR1_SSM | SPI_CR1_SSI
              | (1 << SPI_CR1_BR_Pos);  // Baud = APB2/4 = 84/4 = 21 MHz
    SPI1->CR2 = SPI_CR2_TXDMAEN;        // Enable DMA for TX
    SPI1->CR1 |= SPI_CR1_SPE;           // Enable SPI
}

void SSD1306_Flush(void) {
    while (dma_busy);  // Wait for previous transfer
    dma_busy = 1;

    GPIOA->BSRR = (1 << 3);   // DC pin HIGH = data mode
    GPIOA->BSRR = (1 << 20);  // CS pin LOW (active)

    // Configure DMA2 Stream3 Ch3 for SPI1_TX
    DMA2_Stream3->CR &= ~DMA_SxCR_EN;
    DMA2_Stream3->CR = (3 << DMA_SxCR_CHSEL_Pos)
                     | DMA_SxCR_MINC | DMA_SxCR_DIR_0  // Mem-to-peripheral
                     | DMA_SxCR_TCIE;                    // Transfer complete IRQ
    DMA2_Stream3->PAR  = (uint32_t)&SPI1->DR;
    DMA2_Stream3->M0AR = (uint32_t)framebuffer;
    DMA2_Stream3->NDTR = SSD1306_BUFSIZE;
    DMA2_Stream3->CR  |= DMA_SxCR_EN;  // GO!
}

// DMA transfer complete ISR
void DMA2_Stream3_IRQHandler(void) {
    DMA2->LIFCR = DMA_LIFCR_CTCIF3;   // Clear interrupt flag
    GPIOA->BSRR = (1 << 4);            // CS HIGH (deselect)
    dma_busy = 0;                       // Signal completion
}

I2C with Interrupts — Sensor Polling

I2C (Inter-Integrated Circuit) is a 2-wire bus (SDA + SCL) that supports multiple devices on one bus, each with a unique 7-bit address. Slower than SPI (100-400 kHz typically) but uses fewer pins. Perfect for sensors.

c
// I2C1 interrupt-driven read from BME280 sensor (address 0x76)
#define BME280_ADDR  (0x76 << 1)  // 7-bit addr shifted left for R/W bit

static volatile uint8_t i2c_buf[8];
static volatile uint8_t i2c_idx = 0;
static volatile uint8_t i2c_done = 0;

void I2C1_Read_IT(uint8_t reg, uint8_t len) {
    i2c_idx = 0; i2c_done = 0;

    // Send register address (write phase)
    I2C1->CR1 |= I2C_CR1_START;          // Generate START
    while (!(I2C1->SR1 & I2C_SR1_SB));    // Wait for START sent
    I2C1->DR = BME280_ADDR | 0;           // Address + Write
    while (!(I2C1->SR1 & I2C_SR1_ADDR));  // Wait for ACK
    (void)I2C1->SR2;                       // Clear ADDR flag
    I2C1->DR = reg;                        // Send register address
    while (!(I2C1->SR1 & I2C_SR1_BTF));   // Wait for byte transferred

    // Restart for read phase (interrupt-driven)
    I2C1->CR1 |= I2C_CR1_START | I2C_CR1_ACK;
    I2C1->CR2 |= I2C_CR2_ITBUFEN | I2C_CR2_ITEVTEN;  // Enable interrupts
}

void I2C1_EV_IRQHandler(void) {
    if (I2C1->SR1 & I2C_SR1_RXNE) {
        i2c_buf[i2c_idx++] = I2C1->DR;
        if (i2c_idx >= 6) {  // BME280: 6 bytes for temp+pressure
            I2C1->CR1 &= ~I2C_CR1_ACK;   // NACK last byte
            I2C1->CR1 |= I2C_CR1_STOP;    // Generate STOP
            I2C1->CR2 &= ~(I2C_CR2_ITBUFEN | I2C_CR2_ITEVTEN);
            i2c_done = 1;
        }
    }
}

UART with Ring Buffer — GPS NMEA Parsing

UART (Universal Asynchronous Receiver/Transmitter) is the classic serial port. No clock wire — both sides agree on baud rate (e.g., 9600 or 115200 bits/sec). GPS modules output NMEA sentences like $GPGGA,123519,4807.038,N,01131.000,E,1,08,0.9,545.4,M,...

c
// Ring buffer UART RX for GPS NMEA at 9600 baud
#define RX_BUF_SIZE 256  // Must be power of 2
static volatile uint8_t rx_buf[RX_BUF_SIZE];
static volatile uint16_t rx_head = 0;  // ISR writes here
static volatile uint16_t rx_tail = 0;  // App reads from here

void USART2_IRQHandler(void) {
    if (USART2->SR & USART_SR_RXNE) {
        rx_buf[rx_head & (RX_BUF_SIZE - 1)] = USART2->DR;
        rx_head++;
    }
}

uint16_t UART_Available(void) {
    return (rx_head - rx_tail) & (RX_BUF_SIZE - 1);
}

uint8_t UART_ReadByte(void) {
    while (rx_head == rx_tail);  // Block until data available
    uint8_t c = rx_buf[rx_tail & (RX_BUF_SIZE - 1)];
    rx_tail++;
    return c;
}
Why power-of-2 buffer size? The mask (RX_BUF_SIZE - 1) replaces expensive modulo with a fast bitwise AND. For size 256: idx & 0xFF wraps around automatically. This matters at 115200 baud where the ISR fires every 87 microseconds.
Peripheral Data Flow

Select a peripheral to see the complete data path from hardware pin to application callback.

In the SPI OLED driver, why do we use DMA for the framebuffer transfer instead of polling?

Chapter 5: Power Management

A battery-powered embedded device that runs at full speed all the time is a dead device in hours. A 240 mAh coin cell at 50 mA lasts 4.8 hours. But the same coin cell at 2 μA lasts 13.7 years. Power management isn't optimization — it's the difference between a viable product and an expensive paperweight.

Cortex-M MCUs provide progressively deeper sleep modes, each trading more functionality for less power:

ModeWhat's OffWhat's OnWake SourceTypical Current
RunNothingEverythingN/A (already running)30-100 mA
SleepCPU coreAll peripherals, SRAM, clocksAny interrupt5-15 mA
StopCPU, HSE, PLL, most peripheralsSRAM content, LSI, RTC, wake pinsEXTI line, RTC alarm10-30 μA
StandbyEverything (SRAM lost!)Backup domain, wake pin logicWKUP pin, RTC, IWDG1-3 μA
Critical tradeoff: Stop mode preserves SRAM (your variables survive), so wakeup is fast — reconfigure clocks and resume. Standby mode erases SRAM, so wakeup is a full reboot (vector table, copy .data, zero .bss, the whole Chapter 1 sequence). Choose based on how much state you need to preserve.
c
// Enter Stop mode, wake on RTC alarm in 5 minutes
void Enter_Stop_Mode(uint32_t wake_seconds) {
    // Configure RTC wakeup timer
    RTC->WPR = 0xCA;  RTC->WPR = 0x53;  // Unlock RTC write protection
    RTC->CR &= ~RTC_CR_WUTE;                // Disable wakeup timer
    while (!(RTC->ISR & RTC_ISR_WUTWF));    // Wait for access
    RTC->WUTR = wake_seconds - 1;           // Set countdown (1 Hz clock)
    RTC->CR |= RTC_CR_WUTE | RTC_CR_WUTIE; // Enable timer + interrupt

    // Configure EXTI line 22 (RTC wakeup) for rising edge
    EXTI->IMR  |= (1 << 22);
    EXTI->RTSR |= (1 << 22);

    // Enter Stop mode
    SCB->SCR |= SCB_SCR_SLEEPDEEP_Msk;  // Deep sleep (not regular sleep)
    PWR->CR  |= PWR_CR_LPDS;            // Low-power voltage regulator in Stop
    __WFI();                              // Wait For Interrupt (CPU stops here)

    // === Execution resumes here after wakeup ===
    SCB->SCR &= ~SCB_SCR_SLEEPDEEP_Msk; // Clear deep sleep bit
    SystemClock_Config();                 // Reconfigure clocks (PLL was off)
}
After Stop mode wakeup: The CPU resumes from the __WFI() instruction. But the HSE and PLL are OFF — the system is running on the slow HSI (16 MHz). You MUST call your clock configuration function again to get back to full speed. Forgetting this = your code runs 10x slower than expected after wakeup.

Battery life calculation for a duty-cycled IoT sensor:

Iavg = (tactive × Iactive + tsleep × Isleep) / (tactive + tsleep)
Life = Cbattery / Iavg

Example: Active 2 seconds at 50 mA, sleep 298 seconds at 2 μA (Stop mode):

Iavg = (2 × 50 + 298 × 0.002) / 300 = 100.596 / 300 = 0.335 mA
Life = 240 mAh / 0.335 mA = 716 hours = 30 days (coin cell)
Life = 3000 mAh / 0.335 mA = 8,955 hours = 373 days (18650 Li-ion)
Power Mode Visualizer

Toggle peripherals and select sleep mode. Watch current draw and battery life update.

Active time (sec) 2
Sleep time (sec) 300
Sleep mode Stop
After waking from Stop mode, what must you do before your code runs at full speed?

Chapter 6: Software Optimization

On a desktop, you optimize for developer time. On an MCU with 256 KB Flash, 64 KB RAM, and no FPU, you optimize for everything: cycles, bytes, and watts. A function that takes 100 cycles instead of 10 doesn't just run slower — it drains 10x more battery during execution. Optimization here is a survival skill.

The three constraints of embedded:
1. Cycles: How many clock ticks does your code take? (speed)
2. Flash: How many bytes of compiled code? (storage)
3. RAM: How many bytes of data at runtime? (memory)
You can rarely optimize all three simultaneously. Trading one for another is the art.

Technique 1: Fixed-Point Math

Many Cortex-M0/M3 MCUs have NO floating-point unit. A single float multiply compiles to a library call of ~30 cycles. The same operation in fixed-point (integers that represent fractions) takes 1 cycle.

c
// Q16.16 fixed-point: upper 16 bits = integer, lower 16 bits = fraction
typedef int32_t fixed_t;

#define FIXED_SHIFT  16
#define FLOAT_TO_FIXED(f)  ((fixed_t)((f) * (1 << FIXED_SHIFT)))
#define FIXED_TO_FLOAT(x)  ((float)(x) / (1 << FIXED_SHIFT))
#define FIXED_MUL(a, b)    ((fixed_t)(((int64_t)(a) * (b)) >> FIXED_SHIFT))

// Example: compute sin(x) using fixed-point Taylor series
// sin(x) ≈ x - x³/6 + x⁵/120  (for small x in radians)
fixed_t fixed_sin(fixed_t x) {
    fixed_t x2 = FIXED_MUL(x, x);
    fixed_t x3 = FIXED_MUL(x2, x);
    fixed_t x5 = FIXED_MUL(x3, x2);
    return x - x3 / 6 + x5 / 120;  // ~3 cycles vs ~200 for float sin()
}

Technique 2: Lookup Tables

Trade Flash for cycles. Pre-compute results and store them in a const array (lives in Flash, costs 0 RAM):

c
// 256-entry sine table: sin(x) for x = 0..255 mapped to 0..2π
// Values scaled to Q1.15 (multiply result by 1/32768 to get float)
static const int16_t sin_table[256] = {
    0, 804, 1608, 2410, 3212, 4011, 4808, 5602,
    6393, 7179, 7962, 8739, 9512, 10278, 11039, 11793,
    // ... 256 entries total, generated offline ...
    32767, 32757, 32728, /* ... */
};

// Lookup: 1 cycle (table access) vs 200 cycles (float sin)
int16_t fast_sin(uint8_t angle) { return sin_table[angle]; }

// Cost: 512 bytes Flash. Savings: ~199 cycles per call.

Technique 3: Loop Unrolling & Bit Manipulation

c
// Naive: branch per iteration (pipeline stall on Cortex-M)
for (int i = 0; i < 16; i++)
    dst[i] = src[i] * gain;

// Unrolled x4: fewer branches, better pipeline utilization
for (int i = 0; i < 16; i += 4) {
    dst[i]   = src[i]   * gain;
    dst[i+1] = src[i+1] * gain;
    dst[i+2] = src[i+2] * gain;
    dst[i+3] = src[i+3] * gain;
}

// Bit manipulation: count set bits (population count)
// Naive: loop through 32 bits = 32 iterations
uint32_t popcount_naive(uint32_t x) {
    uint32_t count = 0;
    while (x) { count += x & 1; x >>= 1; }
    return count;
}
// Optimized: Kernighan's trick — only loops for SET bits
uint32_t popcount_fast(uint32_t x) {
    uint32_t count = 0;
    while (x) { x &= x - 1; count++; }  // Clears lowest set bit each iter
    return count;
}

Technique 4: Compiler Flags

FlagOptimizes ForEffect
-O0DebugabilityNo optimization, 1:1 source mapping
-O2SpeedAggressive: inlining, unrolling, scheduling
-OsSize (Flash)Like -O2 but skips optimizations that increase size
-OgDebug + some speedModerate optimization, good debugging
-fltoCross-file optimizationLink-time optimization, removes unused code
The -Os rule of thumb: For most embedded projects, -Os is the best default. It produces code that's 20-40% smaller than -O2 with only 5-10% speed penalty. When your MCU has 64 KB Flash total, those saved bytes matter. Use -O2 only for specific hot functions via __attribute__((optimize("O2"))).
Optimization Visualizer (SHOWCASE)

Select a function, then apply optimization passes. Watch cycle count, Flash usage, and RAM usage change with each technique.

Why is fixed-point math preferred over float on Cortex-M0/M3 without an FPU?

Chapter 7: Watchdog & Fault Handling

Embedded systems must work for years without human intervention. But bugs happen: infinite loops, null pointer dereferences, stack overflows, corrupted state. On your laptop, the OS kills the process. On an MCU with no OS, a bug means the device hangs forever — unless you've planned for failure.

The Independent Watchdog (IWDG) is a hardware timer that counts down. Your code must periodically "kick" (reset) it before it reaches zero. If your code crashes or hangs, it can't kick the watchdog, the timer expires, and the hardware forces a full system reset. It's a dead man's switch.

c
// Configure IWDG for ~4 second timeout
// IWDG runs on LSI (32 kHz), independent of system clock
void IWDG_Init(void) {
    IWDG->KR = 0x5555;     // Enable register access
    IWDG->PR = 6;           // Prescaler /256 → 32000/256 = 125 Hz
    IWDG->RLR = 500;        // Reload = 500 → timeout = 500/125 = 4 seconds
    IWDG->KR = 0xCCCC;     // Start watchdog (CANNOT be stopped once started!)
}

// Call this in your main loop — if you don't call it within 4s, reset!
void IWDG_Kick(void) {
    IWDG->KR = 0xAAAA;     // Reload counter (kick the dog)
}

// Typical usage pattern:
int main(void) {
    SystemInit();
    IWDG_Init();
    while (1) {
        read_sensors();    // If this hangs → watchdog fires → reset
        process_data();
        transmit();
        IWDG_Kick();       // "I'm still alive!"
        Enter_Stop_Mode(300);
    }
}
Critical: Once started, the IWDG cannot be stopped. This is intentional — if a bug could disable the watchdog, it would defeat the purpose. The watchdog uses the LSI oscillator, which runs independently of the main clock system. Even if your PLL hangs, the watchdog still counts.

HardFault Handler

When the CPU hits an illegal operation — null pointer dereference, unaligned access, divide by zero, stack overflow, bus error — it triggers a HardFault exception. By default this is an infinite loop (device hangs). A good HardFault handler logs the fault information for debugging:

c
// HardFault handler that captures useful debug info
typedef struct {
    uint32_t r0, r1, r2, r3, r12, lr, pc, psr;
} StackFrame_t;

void HardFault_Handler_C(StackFrame_t *frame) {
    // frame->pc = the instruction that caused the fault
    // frame->lr = the return address (who called the faulting function)

    volatile uint32_t cfsr = SCB->CFSR;  // Configurable Fault Status Register
    volatile uint32_t hfsr = SCB->HFSR;  // HardFault Status Register
    volatile uint32_t mmfar = SCB->MMFAR; // MemManage Fault Address
    volatile uint32_t bfar = SCB->BFAR;   // Bus Fault Address

    // Decode fault type
    if (cfsr & 0x0001) { /* IACCVIOL: instruction access violation */ }
    if (cfsr & 0x0002) { /* DACCVIOL: data access violation (null ptr?) */ }
    if (cfsr & 0x0800) { /* UNSTKERR: stack overflow during exception */ }
    if (cfsr & 0x0200 0000) { /* DIVBYZERO: divide by zero */ }

    // Log to backup SRAM (survives reset) for post-mortem debugging
    *(uint32_t*)0x40024000 = frame->pc;   // Faulting PC
    *(uint32_t*)0x40024004 = cfsr;         // Fault type

    NVIC_SystemReset();  // Reset and hope for the best
}
Stack overflow detection: Fill the bottom of the stack with a known pattern (e.g., 0xDEADBEEF). Periodically check if it's been overwritten. If yes, your stack has grown into the heap/bss — you need more stack space or fewer local variables. Some MCUs have hardware MPU (Memory Protection Unit) that can trap stack overflow automatically.
Fault Injection Simulator

Trigger different faults and watch the MCU respond. Green = running. Red = fault. Blue = reset.

Why can't the IWDG be stopped once it's started?

Chapter 8: IoT Case Study — Environmental Monitor

Let's design a complete IoT product: an environmental monitoring node that measures temperature, humidity, and pressure, then transmits data wirelessly every 5 minutes. It must run for 2+ years on a single battery. This is a real product architecture used in agriculture, building management, and industrial monitoring.

System specification:
MCU: STM32L476 (ultra-low-power Cortex-M4, 80 MHz, 1 MB Flash, 128 KB RAM)
Sensor: BME280 (temp/humidity/pressure via I2C, address 0x76)
Radio: SX1276 (LoRa long-range radio via SPI, up to 15 km)
Power: 3000 mAh 18650 Li-ion cell + LDO regulator (3.3V)

Hardware Connections

ComponentInterfaceMCU PinsSpeed
BME280I2C1PB6 (SCL), PB7 (SDA)400 kHz
SX1276SPI1PA5 (SCK), PA6 (MISO), PA7 (MOSI), PA4 (NSS)10 MHz
SX1276 DIO0EXTIPC4 (TX done interrupt)N/A
Status LEDGPIOPA0 (active low)N/A

Firmware State Machine

SLEEP (Stop Mode 2)
2 μA. RTC counts down 5 minutes.
↓ RTC wakeup interrupt
WAKE
Reconfigure clocks (HSE+PLL), re-enable peripherals. 5 ms.
SENSE
I2C read BME280. Forced measurement mode. 50 ms.
TRANSMIT
SPI configure SX1276, send 12-byte LoRa packet. 100 ms @ 120 mA.
↓ DIO0 interrupt (TX done)
SLEEP
Disable peripherals, enter Stop Mode 2.
c
// Main firmware loop — complete IoT sensor node
typedef struct {
    int16_t  temperature;  // 0.01 °C resolution (2345 = 23.45°C)
    uint16_t humidity;     // 0.01 %RH resolution
    uint32_t pressure;     // Pa (101325 = 1013.25 hPa)
    uint16_t battery_mv;   // Battery voltage in mV
    uint16_t seq_num;      // Packet sequence number
} SensorPacket_t;  // 12 bytes total

int main(void) {
    HAL_Init();
    SystemClock_Config();  // 80 MHz from HSE+PLL
    GPIO_Init();
    I2C1_Init();           // 400 kHz for BME280
    SPI1_Init();           // 10 MHz for SX1276
    RTC_Init();            // LSE 32.768 kHz crystal
    IWDG_Init();           // 8 second timeout

    BME280_Init();         // Configure oversampling, filter
    SX1276_Init();         // Configure LoRa: SF7, BW125, CR4/5

    uint16_t seq = 0;

    while (1) {
        // === SENSE ===
        SensorPacket_t pkt;
        BME280_TriggerMeasurement();       // Start forced conversion
        HAL_Delay(50);                      // Wait for measurement
        BME280_Read(&pkt.temperature, &pkt.humidity, &pkt.pressure);
        pkt.battery_mv = ADC_ReadBattery();
        pkt.seq_num = seq++;

        // === TRANSMIT ===
        SX1276_Transmit((uint8_t*)&pkt, sizeof(pkt));
        while (!sx1276_tx_done);            // Wait for DIO0 interrupt
        sx1276_tx_done = 0;

        // === SLEEP ===
        SX1276_Sleep();                    // Put radio in sleep (1 µA)
        IWDG_Kick();                       // Kick before sleeping
        Enter_Stop_Mode(300);              // Sleep 5 minutes

        // === WAKE (execution resumes here) ===
        SystemClock_Config();              // Restore 80 MHz
        IWDG_Kick();                       // Kick immediately after wake
    }
}

Power Budget

PhaseDurationCurrentCharge per Cycle
Wake + Clock config5 ms10 mA0.0139 μAh
I2C sensor read50 ms5 mA0.0694 μAh
SPI radio TX100 ms120 mA3.333 μAh
Stop Mode sleep299.845 s2 μA0.1666 μAh
Total per cycle300 s3.583 μAh
Iavg = 3.583 μAh / (300/3600 h) = 3.583 / 0.0833 = 43 μA
Battery life = 3000 mAh / 0.043 mA = 69,767 hours = 7.97 years
Result: With a 3000 mAh 18650 battery, this design lasts nearly 8 years on a single charge. In practice, battery self-discharge (~2%/year) limits real life to ~4-5 years. Still well above the 2-year requirement.
IoT System Simulator

Watch the complete duty cycle: sleep → wake → sense → transmit → sleep. Time is accelerated 1000x.

TX interval (min) 5
Radio TX power (mA) 120
In the power budget, which phase dominates the energy consumption per cycle?

Chapter 9: Mastery & Connections

You now understand the complete embedded firmware stack: from the first byte of the vector table through clock configuration, DMA transfers, peripheral drivers, power optimization, fault handling, and full IoT system design. Let's consolidate with reference tables and a design challenge.

Peripheral Configuration Cheat Sheet

PeripheralEnable ClockKey RegistersTypical Config
GPIORCC->AHB1ENRMODER, ODR, IDR, AFRSet MODER for mode, AFR for alternate function
SPIRCC->APB1/2ENRCR1, CR2, DR, SRMaster, 8-bit, CPOL/CPHA, baud prescaler
I2CRCC->APB1ENRCR1, CR2, DR, SR1, SR2400 kHz, 7-bit addr, interrupt mode
UARTRCC->APB1/2ENRCR1, BRR, DR, SRBRR = PCLK/baud, enable RXNEIE for IRQ
ADCRCC->APB2ENRCR1, CR2, SQR, DR12-bit, single conversion, DMA enable
TimerRCC->APB1/2ENRCR1, PSC, ARR, CCRPSC=PCLK/desired-1, ARR=period-1
DMARCC->AHB1ENRCR, PAR, M0AR, NDTRChannel select, direction, sizes, MINC, CIRC

DMA Channel/Stream Assignment (STM32F4)

DMAStreamChannelPeripheral
DMA2Stream 0Ch 0ADC1
DMA2Stream 3Ch 3SPI1_TX
DMA2Stream 2Ch 3SPI1_RX
DMA1Stream 5Ch 4USART2_RX
DMA1Stream 6Ch 4USART2_TX
DMA1Stream 0Ch 1I2C1_RX

Power Optimization Checklist

Before shipping any battery-powered firmware:
□ All unused GPIO pins set to analog mode (lowest leakage)
□ All unused peripheral clocks disabled
□ Debug pins (SWD) disabled in production build
□ Flash prefetch enabled during Run mode
□ Voltage regulator in low-power mode during Stop
□ LSE crystal for RTC (not LSI — 100x more accurate for long sleeps)
□ Radio in sleep mode when not transmitting
□ Sensors in forced/one-shot mode (not continuous)
□ Measure actual current with multimeter — never trust calculations alone

Design Challenge: Smart Door Lock

Your turn. Design the firmware architecture for a battery-powered smart door lock with these components:

MCU: STM32L4 (ultra-low-power)
BLE radio: nRF52832 module (SPI, 10 mA active, 2 μA sleep)
Motor: DC motor with H-bridge (500 mA for 1 second to lock/unlock)
Accelerometer: LIS3DH (I2C, wake-on-motion interrupt)
Battery: 4x AA (6V, 2500 mAh after LDO to 3.3V)

Questions to answer:
1. What's the sleep mode? (accelerometer wake-on-motion as primary wake source)
2. What's the duty cycle? (mostly sleeping, BLE advertising only after motion detected)
3. How do you prevent unauthorized motor activation? (crypto challenge-response over BLE)
4. Estimated battery life? (Calculate: 99.99% sleep @ 5 μA + 10 unlocks/day @ 500 mA for 1s)

Comparison: Bare-Metal vs RTOS

AspectBare-Metal (this lesson)RTOS (FreeRTOS, Zephyr)
ComplexityLow (super-loop)Medium (tasks, queues, mutexes)
RAM overhead0 bytes~2-8 KB (kernel + task stacks)
TimingDeterministic (you control every cycle)Preemptive (scheduler decides)
ConcurrencyISR + main loopMultiple tasks + ISR
Best forSimple sensors, tight power budgetsComplex systems (USB + BLE + display + ...)
Debug difficultyLowHigh (race conditions, priority inversion)

Where to Go Next

You've mastered bare-metal embedded C. The natural next steps:

RTOS — FreeRTOS or Zephyr for complex multi-task systems

Hardware design — Schematic + PCB layout to build your own boards

Wireless protocols — BLE, LoRa, Zigbee, Thread/Matter for IoT

Motor control — PWM, PID loops, FOC for brushless motors

"What I cannot create, I do not understand." — Richard Feynman

You now have enough knowledge to create a complete embedded system from scratch: write the startup code, configure clocks, set up DMA transfers, drive peripherals, manage power, handle faults, and design for years of battery life. Go build something.
For a battery-powered IoT sensor that reads data once every 10 minutes, which approach minimizes power consumption?