Embedded C — From Bare Metal to IoT Mastery

Chapter 0: The Embedded Software Stack

On your laptop, you write Python. You call print() and text appears. You call open("file.txt") and the OS finds a disk, locates the file system, manages memory, schedules your process among hundreds of others. You never think about how hardware works because the operating system is a thick blanket between you and the metal.

On a microcontroller, there is no OS. There is no print(). There is no file system. There is no memory manager. There is no scheduler. You ARE the operating system. Your C code talks directly to hardware through memory-mapped registers — specific addresses in memory that, when written to, physically change the behavior of silicon.

The fundamental shift: In desktop programming, you ask the OS to do things. In embedded programming, you configure hardware directly by writing specific bit patterns to specific memory addresses. There is no intermediary. You control every electron.

But raw register-bashing everywhere creates unmaintainable spaghetti. So embedded engineers organize code into layers:

Application

Your logic: read sensor, make decision, actuate

↓

Middleware

Protocols, state machines, data processing

↓

Drivers

Peripheral-specific logic (UART driver, SPI driver)

↓

HAL (Hardware Abstraction Layer)

↓

Hardware

MCU silicon: GPIO pins, timers, ADC, DMA, buses

Why bother with layers? Portability. If you swap from an STM32F4 to an STM32L4, you only rewrite the HAL. Drivers, middleware, and application code stay the same. Without layers, changing MCU means rewriting everything.

Here is what a raw register write looks like vs. a HAL call:

c
// Raw register: turn on LED on PA5 (STM32F4)
*(volatile uint32_t*)0x40020014 |= (1 << 5);  // GPIOA->ODR bit 5

// HAL equivalent: portable, readable
HAL_GPIO_WritePin(GPIOA, GPIO_PIN_5, GPIO_PIN_SET);

Both do the exact same thing: set bit 5 of the GPIOA output data register to 1, driving pin PA5 high, lighting the LED. But the HAL version works on any STM32 family without change.

Key insight: Every peripheral on an MCU is just a collection of registers at fixed memory addresses. A "driver" is just code that writes the right bits to the right addresses in the right order. The datasheet tells you which addresses and which bits. That's ALL embedded programming is.

Let's see this layered architecture interactively. Click each layer below to see what it does and what it calls:

Embedded Software Stack

Click any layer to see its responsibilities and API boundaries.

Notice how each layer only talks to the one directly below it. The application never writes to a register directly. The HAL never makes application-level decisions. This discipline is what makes firmware maintainable as projects grow to 50,000+ lines.

Layer	Example Function	Touches Hardware?
Application	`read_temperature()`	No
Middleware	`filter_samples(buf, len)`	No
Driver	`i2c_read(addr, reg, data, len)`	Via HAL
HAL	`HAL_I2C_Mem_Read(...)`	Yes (registers)
Hardware	I2C peripheral silicon	IS hardware

Why do embedded engineers organize code into layers (HAL, Drivers, Middleware, Application)?

To make the code run faster To use less Flash memory So you can swap MCU hardware without rewriting application code Because the compiler requires it

Chapter 1: Startup Code & Initialization

You write int main(void) { ... } and assume it runs. But what happens before main? On your laptop, the OS loads your binary, sets up the stack, initializes libc, and jumps to main. On a microcontroller, there is no OS to do this. A small piece of assembly code — the startup code — does it instead.

When you press reset (or power on), the Cortex-M CPU does exactly two things:

1. Load SP from address 0x00000000

The first 4 bytes of Flash contain the initial stack pointer value

↓

2. Load PC from address 0x00000004

The next 4 bytes contain the address of Reset_Handler

↓

3. Execute Reset_Handler

Startup assembly code begins running

This is the vector table — a table of addresses stored at the very beginning of Flash memory. The first entry is the initial stack pointer. The second is the reset handler address. Entries 3-256+ are interrupt handler addresses (we'll use those later).

c
// Vector table (simplified) — lives at address 0x08000000 in Flash
uint32_t vectors[] __attribute__((section(".isr_vector"))) = {
    (uint32_t)&_estack,        // 0x00: Initial Stack Pointer (top of SRAM)
    (uint32_t)Reset_Handler,   // 0x04: Reset handler address
    (uint32_t)NMI_Handler,     // 0x08: Non-maskable interrupt
    (uint32_t)HardFault_Handler,// 0x0C: Hard fault
    // ... more interrupt vectors ...
};

The Reset_Handler is startup assembly code that performs these steps in order:

Boot sequence (Cortex-M):
1. Set stack pointer (already done by hardware from vector table)
2. Copy .data section from Flash to SRAM (initialized global variables)
3. Zero the .bss section in SRAM (uninitialized globals = 0)
4. Call SystemInit() — configure clock system
5. Call __libc_init_array() — C++ constructors (if any)
6. Call main()

Why copy .data from Flash? Because global variables like int counter = 42; need their initial value (42) stored somewhere permanent (Flash), but they live in SRAM at runtime so they can be modified. The startup code copies these initial values from Flash to SRAM.

Why zero .bss? The C standard guarantees that uninitialized globals start at zero. The startup code enforces this by memset-ing the .bss region to 0x00.

arm-asm
/* Simplified Reset_Handler (ARM Cortex-M) */
Reset_Handler:
    /* Copy .data from Flash to SRAM */
    ldr  r0, =_sdata      @ destination start (SRAM)
    ldr  r1, =_edata      @ destination end
    ldr  r2, =_sidata     @ source (Flash)
copy_loop:
    cmp  r0, r1
    bge  zero_bss
    ldr  r3, [r2], #4    @ load word from Flash, advance
    str  r3, [r0], #4    @ store word to SRAM, advance
    b    copy_loop

zero_bss:
    /* Zero .bss in SRAM */
    ldr  r0, =_sbss       @ bss start
    ldr  r1, =_ebss       @ bss end
    movs r2, #0
bss_loop:
    cmp  r0, r1
    bge  call_main
    str  r2, [r0], #4    @ store zero, advance
    b    bss_loop

call_main:
    bl   SystemInit        @ configure clocks
    bl   main              @ jump to your code!
    b    .                 @ infinite loop if main returns

Key insight: If main() ever returns on a microcontroller, there's nothing to return TO. No OS to catch you. The startup code has an infinite loop after bl main as a safety net. In practice, your main() should contain while(1) { ... } and never return.

Here is the memory layout at boot. Watch the startup code copy .data and zero .bss before jumping to main:

Boot Sequence Visualizer

Click "Boot" to watch the startup sequence step by step. Each phase lights up as it executes.

Memory Region	Address Range (STM32F4)	Contents
Flash	0x0800 0000 – 0x080F FFFF	Code + .data init values + vector table
SRAM	0x2000 0000 – 0x2001 FFFF	.data (copied) + .bss (zeroed) + heap + stack
Peripherals	0x4000 0000 – 0x5FFF FFFF	Memory-mapped registers
Cortex-M Core	0xE000 0000 – 0xE00F FFFF	NVIC, SysTick, debug registers

What is stored at address 0x00000000 in the vector table of a Cortex-M MCU?

The initial Stack Pointer value The address of main() The Reset_Handler address The first instruction to execute

Chapter 2: Clock Configuration

Every digital circuit needs a clock — a periodic signal that tells transistors when to evaluate their inputs. The MCU's clock determines how fast instructions execute, how fast peripherals run, and how much power is consumed. A faster clock = faster execution but more power. Embedded systems carefully configure clocks to balance speed and battery life.

A typical STM32 has multiple clock sources:

Source	Frequency	Accuracy	Use Case
HSI (High-Speed Internal)	8-16 MHz	±1-2%	Fast boot, no external parts
HSE (High-Speed External)	4-26 MHz crystal	±20 ppm	Precise timing, USB, radio
LSI (Low-Speed Internal)	32 kHz	±5%	Watchdog timer
LSE (Low-Speed External)	32.768 kHz crystal	±20 ppm	RTC (real-time clock)
PLL (Phase-Locked Loop)	Up to 480 MHz	Derived from HSI/HSE	Maximum CPU speed

The PLL is the key. It takes a reference clock (HSI or HSE) and multiplies it up to a much higher frequency. The formula:

f_VCO = f_input × (PLL_N / PLL_M)

f_SYSCLK = f_VCO / PLL_P

The clock tree distributes SYSCLK to all subsystems via prescalers (dividers):

HSE (8 MHz crystal)

Precise external source

↓ ÷ PLL_M

PLL Input (1-2 MHz)

Must be 1-2 MHz for stability

↓ × PLL_N

VCO (100-432 MHz)

Internal oscillator

↓ ÷ PLL_P

SYSCLK (up to 168 MHz)

CPU clock

↓ prescalers

AHB → APB1 (42 MHz) / APB2 (84 MHz)

Peripheral bus clocks

Worked example: Configure STM32F4 for 168 MHz from an 8 MHz crystal:
• PLL_M = 8 → PLL input = 8/8 = 1 MHz
• PLL_N = 336 → VCO = 1 × 336 = 336 MHz
• PLL_P = 2 → SYSCLK = 336/2 = 168 MHz
• AHB prescaler = 1 → HCLK = 168 MHz
• APB1 prescaler = 4 → APB1 = 42 MHz
• APB2 prescaler = 2 → APB2 = 84 MHz

c
// STM32F4 clock configuration for 168 MHz
void SystemClock_Config(void) {
    // Enable HSE
    RCC->CR |= RCC_CR_HSEON;
    while (!(RCC->CR & RCC_CR_HSERDY));  // Wait for HSE ready

    // Configure PLL: source=HSE, M=8, N=336, P=2, Q=7
    RCC->PLLCFGR = RCC_PLLCFGR_PLLSRC_HSE
                 | (8  << RCC_PLLCFGR_PLLM_Pos)   // M = 8
                 | (336 << RCC_PLLCFGR_PLLN_Pos)   // N = 336
                 | (0  << RCC_PLLCFGR_PLLP_Pos)   // P = 2 (0 = /2)
                 | (7  << RCC_PLLCFGR_PLLQ_Pos);  // Q = 7 (for USB)

    // Enable PLL
    RCC->CR |= RCC_CR_PLLON;
    while (!(RCC->CR & RCC_CR_PLLRDY));  // Wait for PLL lock

    // Configure Flash latency (5 wait states for 168 MHz)
    FLASH->ACR = FLASH_ACR_LATENCY_5WS | FLASH_ACR_PRFTEN | FLASH_ACR_ICEN;

    // Set bus prescalers: AHB=/1, APB1=/4, APB2=/2
    RCC->CFGR = RCC_CFGR_HPRE_DIV1
              | RCC_CFGR_PPRE1_DIV4
              | RCC_CFGR_PPRE2_DIV2
              | RCC_CFGR_SW_PLL;         // Switch SYSCLK to PLL

    while ((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_PLL); // Wait for switch
}

Flash wait states: Flash memory is slower than the CPU. At 168 MHz, the CPU needs data every 6 ns, but Flash takes ~30 ns. Solution: insert 5 "wait states" (the CPU pauses 5 extra cycles per Flash read) and enable prefetch + instruction cache to hide the latency.

Try the interactive clock tree below. Adjust PLL multipliers and see all derived clocks update in real-time:

Interactive Clock Tree

Adjust PLL_M, PLL_N, and PLL_P to see how SYSCLK and bus clocks change. Red = out of spec.

PLL_M 8

PLL_N 336

PLL_P 2

If HSE = 8 MHz, PLL_M = 4, PLL_N = 168, PLL_P = 2, what is SYSCLK?

84 MHz 168 MHz 336 MHz 42 MHz

Chapter 3: DMA — Direct Memory Access

Imagine you're receiving 1024 bytes over SPI from a sensor. Without DMA, the CPU must execute a loop: read one byte from the SPI data register, store it in a buffer, repeat 1024 times. The CPU is fully occupied doing glorified copying — it can't process data, update displays, or handle other events during this time.

DMA (Direct Memory Access) is a hardware peripheral that copies data between memory and peripherals without CPU involvement. You tell the DMA controller: "Copy 1024 bytes from address X to address Y, then interrupt me when done." The CPU is free to do other work while the DMA engine handles the transfer in the background.

The analogy: Without DMA, you (the CPU) are a delivery driver carrying packages one at a time from the warehouse (peripheral) to the store (memory). With DMA, you hire a conveyor belt (DMA controller) to move packages automatically while you go do useful work. When the belt finishes, it rings a bell (interrupt).

DMA has several modes:

Mode	Description	Use Case
Normal	Transfer N items, stop, interrupt	One-shot sensor reads
Circular	Transfer N items, auto-restart from beginning	Continuous ADC sampling
Double-buffer	Alternate between two buffers automatically	Audio streaming (process buf A while filling buf B)

Transfer Direction	Source	Destination	Example
Peripheral → Memory	ADC_DR, SPI_DR	SRAM buffer	ADC samples to buffer
Memory → Peripheral	SRAM buffer	SPI_DR, DAC_DHR	Display framebuffer to SPI
Memory → Memory	SRAM region A	SRAM region B	Fast memcpy in hardware

c
// Configure DMA2 Stream0 to transfer 1024 ADC samples to SRAM buffer
uint16_t adc_buffer[1024];

void DMA_ADC_Init(void) {
    // Enable DMA2 clock
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;

    // Disable stream before configuring
    DMA2_Stream0->CR &= ~DMA_SxCR_EN;
    while (DMA2_Stream0->CR & DMA_SxCR_EN); // Wait until disabled

    // Configure stream
    DMA2_Stream0->CR = (0 << DMA_SxCR_CHSEL_Pos)  // Channel 0 (ADC1)
                     | DMA_SxCR_MSIZE_0           // Memory size: 16-bit
                     | DMA_SxCR_PSIZE_0           // Peripheral size: 16-bit
                     | DMA_SxCR_MINC              // Memory address increment
                     | DMA_SxCR_CIRC              // Circular mode
                     | DMA_SxCR_TCIE;             // Transfer complete interrupt

    // Set addresses and count
    DMA2_Stream0->PAR  = (uint32_t)&ADC1->DR;       // Source: ADC data reg
    DMA2_Stream0->M0AR = (uint32_t)adc_buffer;      // Dest: our buffer
    DMA2_Stream0->NDTR = 1024;                     // Number of transfers

    // Enable stream
    DMA2_Stream0->CR |= DMA_SxCR_EN;
}

Critical detail — MINC vs PINC: We set MINC (memory increment) so each transfer goes to the next buffer slot. We do NOT set PINC (peripheral increment) because we always read from the same ADC data register address. Getting this wrong = data all in one slot, or reading garbage addresses.

DMA Transfer Animation

Watch bytes flow from the ADC peripheral to memory via DMA while the CPU independently runs application code.

The performance difference is dramatic. For 1024 16-bit ADC samples at 84 MHz APB2:

Method	CPU Cycles Used	CPU Availability
Polling (no DMA)	~10,240 cycles	0% during transfer
Interrupt per sample	~30 cycles × 1024 = 30,720	Intermittent (context switch overhead)
DMA	~50 cycles (setup + ISR)	99.5% during transfer

In circular DMA mode, what happens when the transfer counter reaches zero?

The DMA stream disables itself It automatically reloads the counter and restarts from the beginning of the buffer It triggers a hard fault The CPU must manually restart it

Chapter 4: Advanced Peripheral Interfacing

Now we combine everything: DMA, interrupts, and driver architecture to build real peripheral interfaces. Every production driver follows the same pattern: init → configure → start → ISR → callback. Let's build three complete drivers.

The driver pattern:
1. Init: Enable clock, configure GPIO pins, set peripheral registers
2. Configure: Set speed, mode, DMA/interrupt enables
3. Start: Enable peripheral, begin transfer
4. ISR: Hardware calls your interrupt handler when events occur
5. Callback: ISR calls application code (transfer complete, error, etc.)

SPI with DMA — OLED Display Driver

SPI (Serial Peripheral Interface) is a high-speed (up to 50 MHz) full-duplex bus with 4 wires: MOSI (data out), MISO (data in), SCK (clock), CS (chip select). Perfect for displays because it's fast and you're usually only sending data (display doesn't talk back much).

c
// SPI1 driver for SSD1306 OLED (128x64, monochrome)
// Pins: PA5=SCK, PA7=MOSI, PA4=CS, PA3=DC (data/command), PA2=RST

#define SSD1306_WIDTH   128
#define SSD1306_HEIGHT  64
#define SSD1306_BUFSIZE (SSD1306_WIDTH * SSD1306_HEIGHT / 8)  // 1024 bytes

static uint8_t framebuffer[SSD1306_BUFSIZE];
static volatile uint8_t dma_busy = 0;

void SPI1_Init(void) {
    // Enable clocks
    RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
    RCC->APB2ENR |= RCC_APB2ENR_SPI1EN;
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;

    // Configure PA5, PA7 as AF5 (SPI1), PA4/PA3/PA2 as output
    GPIOA->MODER |= (2 << 10) | (2 << 14)   // PA5, PA7 = alternate function
                  | (1 << 8) | (1 << 6) | (1 << 4); // PA4,3,2 = output
    GPIOA->AFR[0] |= (5 << 20) | (5 << 28);   // AF5 for PA5, PA7

    // SPI config: master, 8-bit, CPOL=0, CPHA=0, baud=/4 (21 MHz)
    SPI1->CR1 = SPI_CR1_MSTR | SPI_CR1_SSM | SPI_CR1_SSI
              | (1 << SPI_CR1_BR_Pos);  // Baud = APB2/4 = 84/4 = 21 MHz
    SPI1->CR2 = SPI_CR2_TXDMAEN;        // Enable DMA for TX
    SPI1->CR1 |= SPI_CR1_SPE;           // Enable SPI
}

void SSD1306_Flush(void) {
    while (dma_busy);  // Wait for previous transfer
    dma_busy = 1;

    GPIOA->BSRR = (1 << 3);   // DC pin HIGH = data mode
    GPIOA->BSRR = (1 << 20);  // CS pin LOW (active)

    // Configure DMA2 Stream3 Ch3 for SPI1_TX
    DMA2_Stream3->CR &= ~DMA_SxCR_EN;
    DMA2_Stream3->CR = (3 << DMA_SxCR_CHSEL_Pos)
                     | DMA_SxCR_MINC | DMA_SxCR_DIR_0  // Mem-to-peripheral
                     | DMA_SxCR_TCIE;                    // Transfer complete IRQ
    DMA2_Stream3->PAR  = (uint32_t)&SPI1->DR;
    DMA2_Stream3->M0AR = (uint32_t)framebuffer;
    DMA2_Stream3->NDTR = SSD1306_BUFSIZE;
    DMA2_Stream3->CR  |= DMA_SxCR_EN;  // GO!
}

// DMA transfer complete ISR
void DMA2_Stream3_IRQHandler(void) {
    DMA2->LIFCR = DMA_LIFCR_CTCIF3;   // Clear interrupt flag
    GPIOA->BSRR = (1 << 4);            // CS HIGH (deselect)
    dma_busy = 0;                       // Signal completion
}

I2C with Interrupts — Sensor Polling

I2C (Inter-Integrated Circuit) is a 2-wire bus (SDA + SCL) that supports multiple devices on one bus, each with a unique 7-bit address. Slower than SPI (100-400 kHz typically) but uses fewer pins. Perfect for sensors.

c
// I2C1 interrupt-driven read from BME280 sensor (address 0x76)
#define BME280_ADDR  (0x76 << 1)  // 7-bit addr shifted left for R/W bit

static volatile uint8_t i2c_buf[8];
static volatile uint8_t i2c_idx = 0;
static volatile uint8_t i2c_done = 0;

void I2C1_Read_IT(uint8_t reg, uint8_t len) {
    i2c_idx = 0; i2c_done = 0;

    // Send register address (write phase)
    I2C1->CR1 |= I2C_CR1_START;          // Generate START
    while (!(I2C1->SR1 & I2C_SR1_SB));    // Wait for START sent
    I2C1->DR = BME280_ADDR | 0;           // Address + Write
    while (!(I2C1->SR1 & I2C_SR1_ADDR));  // Wait for ACK
    (void)I2C1->SR2;                       // Clear ADDR flag
    I2C1->DR = reg;                        // Send register address
    while (!(I2C1->SR1 & I2C_SR1_BTF));   // Wait for byte transferred

    // Restart for read phase (interrupt-driven)
    I2C1->CR1 |= I2C_CR1_START | I2C_CR1_ACK;
    I2C1->CR2 |= I2C_CR2_ITBUFEN | I2C_CR2_ITEVTEN;  // Enable interrupts
}

void I2C1_EV_IRQHandler(void) {
    if (I2C1->SR1 & I2C_SR1_RXNE) {
        i2c_buf[i2c_idx++] = I2C1->DR;
        if (i2c_idx >= 6) {  // BME280: 6 bytes for temp+pressure
            I2C1->CR1 &= ~I2C_CR1_ACK;   // NACK last byte
            I2C1->CR1 |= I2C_CR1_STOP;    // Generate STOP
            I2C1->CR2 &= ~(I2C_CR2_ITBUFEN | I2C_CR2_ITEVTEN);
            i2c_done = 1;
        }
    }
}

UART with Ring Buffer — GPS NMEA Parsing

UART (Universal Asynchronous Receiver/Transmitter) is the classic serial port. No clock wire — both sides agree on baud rate (e.g., 9600 or 115200 bits/sec). GPS modules output NMEA sentences like $GPGGA,123519,4807.038,N,01131.000,E,1,08,0.9,545.4,M,...

c
// Ring buffer UART RX for GPS NMEA at 9600 baud
#define RX_BUF_SIZE 256  // Must be power of 2
static volatile uint8_t rx_buf[RX_BUF_SIZE];
static volatile uint16_t rx_head = 0;  // ISR writes here
static volatile uint16_t rx_tail = 0;  // App reads from here

void USART2_IRQHandler(void) {
    if (USART2->SR & USART_SR_RXNE) {
        rx_buf[rx_head & (RX_BUF_SIZE - 1)] = USART2->DR;
        rx_head++;
    }
}

uint16_t UART_Available(void) {
    return (rx_head - rx_tail) & (RX_BUF_SIZE - 1);
}

uint8_t UART_ReadByte(void) {
    while (rx_head == rx_tail);  // Block until data available
    uint8_t c = rx_buf[rx_tail & (RX_BUF_SIZE - 1)];
    rx_tail++;
    return c;
}

Why power-of-2 buffer size? The mask (RX_BUF_SIZE - 1) replaces expensive modulo with a fast bitwise AND. For size 256: idx & 0xFF wraps around automatically. This matters at 115200 baud where the ISR fires every 87 microseconds.

Peripheral Data Flow

Select a peripheral to see the complete data path from hardware pin to application callback.

In the SPI OLED driver, why do we use DMA for the framebuffer transfer instead of polling?

SPI doesn't support polling mode DMA uses less Flash memory DMA frees the CPU to compute the next frame while the current one transfers The OLED requires DMA protocol

Chapter 5: Power Management

A battery-powered embedded device that runs at full speed all the time is a dead device in hours. A 240 mAh coin cell at 50 mA lasts 4.8 hours. But the same coin cell at 2 μA lasts 13.7 years. Power management isn't optimization — it's the difference between a viable product and an expensive paperweight.

Cortex-M MCUs provide progressively deeper sleep modes, each trading more functionality for less power:

Mode	What's Off	What's On	Wake Source	Typical Current
Run	Nothing	Everything	N/A (already running)	30-100 mA
Sleep	CPU core	All peripherals, SRAM, clocks	Any interrupt	5-15 mA
Stop	CPU, HSE, PLL, most peripherals	SRAM content, LSI, RTC, wake pins	EXTI line, RTC alarm	10-30 μA
Standby	Everything (SRAM lost!)	Backup domain, wake pin logic	WKUP pin, RTC, IWDG	1-3 μA

Critical tradeoff: Stop mode preserves SRAM (your variables survive), so wakeup is fast — reconfigure clocks and resume. Standby mode erases SRAM, so wakeup is a full reboot (vector table, copy .data, zero .bss, the whole Chapter 1 sequence). Choose based on how much state you need to preserve.

c
// Enter Stop mode, wake on RTC alarm in 5 minutes
void Enter_Stop_Mode(uint32_t wake_seconds) {
    // Configure RTC wakeup timer
    RTC->WPR = 0xCA;  RTC->WPR = 0x53;  // Unlock RTC write protection
    RTC->CR &= ~RTC_CR_WUTE;                // Disable wakeup timer
    while (!(RTC->ISR & RTC_ISR_WUTWF));    // Wait for access
    RTC->WUTR = wake_seconds - 1;           // Set countdown (1 Hz clock)
    RTC->CR |= RTC_CR_WUTE | RTC_CR_WUTIE; // Enable timer + interrupt

    // Configure EXTI line 22 (RTC wakeup) for rising edge
    EXTI->IMR  |= (1 << 22);
    EXTI->RTSR |= (1 << 22);

    // Enter Stop mode
    SCB->SCR |= SCB_SCR_SLEEPDEEP_Msk;  // Deep sleep (not regular sleep)
    PWR->CR  |= PWR_CR_LPDS;            // Low-power voltage regulator in Stop
    __WFI();                              // Wait For Interrupt (CPU stops here)

    // === Execution resumes here after wakeup ===
    SCB->SCR &= ~SCB_SCR_SLEEPDEEP_Msk; // Clear deep sleep bit
    SystemClock_Config();                 // Reconfigure clocks (PLL was off)
}

After Stop mode wakeup: The CPU resumes from the __WFI() instruction. But the HSE and PLL are OFF — the system is running on the slow HSI (16 MHz). You MUST call your clock configuration function again to get back to full speed. Forgetting this = your code runs 10x slower than expected after wakeup.

Battery life calculation for a duty-cycled IoT sensor:

I_avg = (t_active × I_active + t_sleep × I_sleep) / (t_active + t_sleep)

Life = C_battery / I_avg

Example: Active 2 seconds at 50 mA, sleep 298 seconds at 2 μA (Stop mode):

I_avg = (2 × 50 + 298 × 0.002) / 300 = 100.596 / 300 = 0.335 mA

Life = 240 mAh / 0.335 mA = 716 hours = 30 days (coin cell)

Life = 3000 mAh / 0.335 mA = 8,955 hours = 373 days (18650 Li-ion)

Power Mode Visualizer

Toggle peripherals and select sleep mode. Watch current draw and battery life update.

Active time (sec) 2

Sleep time (sec) 300

Sleep mode Stop

After waking from Stop mode, what must you do before your code runs at full speed?

Nothing, it auto-resumes at full speed Reconfigure the clock system (HSE + PLL were off) Reload the firmware from Flash Re-initialize SRAM contents

Chapter 6: Software Optimization

On a desktop, you optimize for developer time. On an MCU with 256 KB Flash, 64 KB RAM, and no FPU, you optimize for everything: cycles, bytes, and watts. A function that takes 100 cycles instead of 10 doesn't just run slower — it drains 10x more battery during execution. Optimization here is a survival skill.

The three constraints of embedded:
1. Cycles: How many clock ticks does your code take? (speed)
2. Flash: How many bytes of compiled code? (storage)
3. RAM: How many bytes of data at runtime? (memory)
You can rarely optimize all three simultaneously. Trading one for another is the art.

Technique 1: Fixed-Point Math

Many Cortex-M0/M3 MCUs have NO floating-point unit. A single float multiply compiles to a library call of ~30 cycles. The same operation in fixed-point (integers that represent fractions) takes 1 cycle.

c
// Q16.16 fixed-point: upper 16 bits = integer, lower 16 bits = fraction
typedef int32_t fixed_t;

#define FIXED_SHIFT  16
#define FLOAT_TO_FIXED(f)  ((fixed_t)((f) * (1 << FIXED_SHIFT)))
#define FIXED_TO_FLOAT(x)  ((float)(x) / (1 << FIXED_SHIFT))
#define FIXED_MUL(a, b)    ((fixed_t)(((int64_t)(a) * (b)) >> FIXED_SHIFT))

// Example: compute sin(x) using fixed-point Taylor series
// sin(x) ≈ x - x³/6 + x⁵/120  (for small x in radians)
fixed_t fixed_sin(fixed_t x) {
    fixed_t x2 = FIXED_MUL(x, x);
    fixed_t x3 = FIXED_MUL(x2, x);
    fixed_t x5 = FIXED_MUL(x3, x2);
    return x - x3 / 6 + x5 / 120;  // ~3 cycles vs ~200 for float sin()
}

Technique 2: Lookup Tables

Trade Flash for cycles. Pre-compute results and store them in a const array (lives in Flash, costs 0 RAM):

c
// 256-entry sine table: sin(x) for x = 0..255 mapped to 0..2π
// Values scaled to Q1.15 (multiply result by 1/32768 to get float)
static const int16_t sin_table[256] = {
    0, 804, 1608, 2410, 3212, 4011, 4808, 5602,
    6393, 7179, 7962, 8739, 9512, 10278, 11039, 11793,
    // ... 256 entries total, generated offline ...
    32767, 32757, 32728, /* ... */
};

// Lookup: 1 cycle (table access) vs 200 cycles (float sin)
int16_t fast_sin(uint8_t angle) { return sin_table[angle]; }

// Cost: 512 bytes Flash. Savings: ~199 cycles per call.

Technique 3: Loop Unrolling & Bit Manipulation

c
// Naive: branch per iteration (pipeline stall on Cortex-M)
for (int i = 0; i < 16; i++)
    dst[i] = src[i] * gain;

// Unrolled x4: fewer branches, better pipeline utilization
for (int i = 0; i < 16; i += 4) {
    dst[i]   = src[i]   * gain;
    dst[i+1] = src[i+1] * gain;
    dst[i+2] = src[i+2] * gain;
    dst[i+3] = src[i+3] * gain;
}

// Bit manipulation: count set bits (population count)
// Naive: loop through 32 bits = 32 iterations
uint32_t popcount_naive(uint32_t x) {
    uint32_t count = 0;
    while (x) { count += x & 1; x >>= 1; }
    return count;
}
// Optimized: Kernighan's trick — only loops for SET bits
uint32_t popcount_fast(uint32_t x) {
    uint32_t count = 0;
    while (x) { x &= x - 1; count++; }  // Clears lowest set bit each iter
    return count;
}

Technique 4: Compiler Flags

Flag	Optimizes For	Effect
`-O0`	Debugability	No optimization, 1:1 source mapping
`-O2`	Speed	Aggressive: inlining, unrolling, scheduling
`-Os`	Size (Flash)	Like -O2 but skips optimizations that increase size
`-Og`	Debug + some speed	Moderate optimization, good debugging
`-flto`	Cross-file optimization	Link-time optimization, removes unused code

The -Os rule of thumb: For most embedded projects, -Os is the best default. It produces code that's 20-40% smaller than -O2 with only 5-10% speed penalty. When your MCU has 64 KB Flash total, those saved bytes matter. Use -O2 only for specific hot functions via __attribute__((optimize("O2"))).

Optimization Visualizer (SHOWCASE)

Select a function, then apply optimization passes. Watch cycle count, Flash usage, and RAM usage change with each technique.

Why is fixed-point math preferred over float on Cortex-M0/M3 without an FPU?

Float operations compile to slow library calls (~30-200 cycles) vs 1-cycle integer operations Floats use more RAM The C standard prohibits floats on embedded Fixed-point is more accurate

Chapter 7: Watchdog & Fault Handling

Embedded systems must work for years without human intervention. But bugs happen: infinite loops, null pointer dereferences, stack overflows, corrupted state. On your laptop, the OS kills the process. On an MCU with no OS, a bug means the device hangs forever — unless you've planned for failure.

The Independent Watchdog (IWDG) is a hardware timer that counts down. Your code must periodically "kick" (reset) it before it reaches zero. If your code crashes or hangs, it can't kick the watchdog, the timer expires, and the hardware forces a full system reset. It's a dead man's switch.

c
// Configure IWDG for ~4 second timeout
// IWDG runs on LSI (32 kHz), independent of system clock
void IWDG_Init(void) {
    IWDG->KR = 0x5555;     // Enable register access
    IWDG->PR = 6;           // Prescaler /256 → 32000/256 = 125 Hz
    IWDG->RLR = 500;        // Reload = 500 → timeout = 500/125 = 4 seconds
    IWDG->KR = 0xCCCC;     // Start watchdog (CANNOT be stopped once started!)
}

// Call this in your main loop — if you don't call it within 4s, reset!
void IWDG_Kick(void) {
    IWDG->KR = 0xAAAA;     // Reload counter (kick the dog)
}

// Typical usage pattern:
int main(void) {
    SystemInit();
    IWDG_Init();
    while (1) {
        read_sensors();    // If this hangs → watchdog fires → reset
        process_data();
        transmit();
        IWDG_Kick();       // "I'm still alive!"
        Enter_Stop_Mode(300);
    }
}

Critical: Once started, the IWDG cannot be stopped. This is intentional — if a bug could disable the watchdog, it would defeat the purpose. The watchdog uses the LSI oscillator, which runs independently of the main clock system. Even if your PLL hangs, the watchdog still counts.

HardFault Handler

When the CPU hits an illegal operation — null pointer dereference, unaligned access, divide by zero, stack overflow, bus error — it triggers a HardFault exception. By default this is an infinite loop (device hangs). A good HardFault handler logs the fault information for debugging:

c
// HardFault handler that captures useful debug info
typedef struct {
    uint32_t r0, r1, r2, r3, r12, lr, pc, psr;
} StackFrame_t;

void HardFault_Handler_C(StackFrame_t *frame) {
    // frame->pc = the instruction that caused the fault
    // frame->lr = the return address (who called the faulting function)

    volatile uint32_t cfsr = SCB->CFSR;  // Configurable Fault Status Register
    volatile uint32_t hfsr = SCB->HFSR;  // HardFault Status Register
    volatile uint32_t mmfar = SCB->MMFAR; // MemManage Fault Address
    volatile uint32_t bfar = SCB->BFAR;   // Bus Fault Address

    // Decode fault type
    if (cfsr & 0x0001) { /* IACCVIOL: instruction access violation */ }
    if (cfsr & 0x0002) { /* DACCVIOL: data access violation (null ptr?) */ }
    if (cfsr & 0x0800) { /* UNSTKERR: stack overflow during exception */ }
    if (cfsr & 0x0200 0000) { /* DIVBYZERO: divide by zero */ }

    // Log to backup SRAM (survives reset) for post-mortem debugging
    *(uint32_t*)0x40024000 = frame->pc;   // Faulting PC
    *(uint32_t*)0x40024004 = cfsr;         // Fault type

    NVIC_SystemReset();  // Reset and hope for the best
}

Stack overflow detection: Fill the bottom of the stack with a known pattern (e.g., 0xDEADBEEF). Periodically check if it's been overwritten. If yes, your stack has grown into the heap/bss — you need more stack space or fewer local variables. Some MCUs have hardware MPU (Memory Protection Unit) that can trap stack overflow automatically.

Fault Injection Simulator

Trigger different faults and watch the MCU respond. Green = running. Red = fault. Blue = reset.

Why can't the IWDG be stopped once it's started?

It's a hardware limitation of the LSI oscillator The register is read-only after initialization If a bug could disable it, the safety mechanism would be defeated It's required by the ARM specification

Chapter 8: IoT Case Study — Environmental Monitor

Let's design a complete IoT product: an environmental monitoring node that measures temperature, humidity, and pressure, then transmits data wirelessly every 5 minutes. It must run for 2+ years on a single battery. This is a real product architecture used in agriculture, building management, and industrial monitoring.

System specification:
• MCU: STM32L476 (ultra-low-power Cortex-M4, 80 MHz, 1 MB Flash, 128 KB RAM)
• Sensor: BME280 (temp/humidity/pressure via I2C, address 0x76)
• Radio: SX1276 (LoRa long-range radio via SPI, up to 15 km)
• Power: 3000 mAh 18650 Li-ion cell + LDO regulator (3.3V)

Hardware Connections

Component	Interface	MCU Pins	Speed
BME280	I2C1	PB6 (SCL), PB7 (SDA)	400 kHz
SX1276	SPI1	PA5 (SCK), PA6 (MISO), PA7 (MOSI), PA4 (NSS)	10 MHz
SX1276 DIO0	EXTI	PC4 (TX done interrupt)	N/A
Status LED	GPIO	PA0 (active low)	N/A

Firmware State Machine

SLEEP (Stop Mode 2)

2 μA. RTC counts down 5 minutes.

↓ RTC wakeup interrupt

WAKE

Reconfigure clocks (HSE+PLL), re-enable peripherals. 5 ms.

↓

SENSE

I2C read BME280. Forced measurement mode. 50 ms.

↓

TRANSMIT

SPI configure SX1276, send 12-byte LoRa packet. 100 ms @ 120 mA.

↓ DIO0 interrupt (TX done)

SLEEP

Disable peripherals, enter Stop Mode 2.

c
// Main firmware loop — complete IoT sensor node
typedef struct {
    int16_t  temperature;  // 0.01 °C resolution (2345 = 23.45°C)
    uint16_t humidity;     // 0.01 %RH resolution
    uint32_t pressure;     // Pa (101325 = 1013.25 hPa)
    uint16_t battery_mv;   // Battery voltage in mV
    uint16_t seq_num;      // Packet sequence number
} SensorPacket_t;  // 12 bytes total

int main(void) {
    HAL_Init();
    SystemClock_Config();  // 80 MHz from HSE+PLL
    GPIO_Init();
    I2C1_Init();           // 400 kHz for BME280
    SPI1_Init();           // 10 MHz for SX1276
    RTC_Init();            // LSE 32.768 kHz crystal
    IWDG_Init();           // 8 second timeout

    BME280_Init();         // Configure oversampling, filter
    SX1276_Init();         // Configure LoRa: SF7, BW125, CR4/5

    uint16_t seq = 0;

    while (1) {
        // === SENSE ===
        SensorPacket_t pkt;
        BME280_TriggerMeasurement();       // Start forced conversion
        HAL_Delay(50);                      // Wait for measurement
        BME280_Read(&pkt.temperature, &pkt.humidity, &pkt.pressure);
        pkt.battery_mv = ADC_ReadBattery();
        pkt.seq_num = seq++;

        // === TRANSMIT ===
        SX1276_Transmit((uint8_t*)&pkt, sizeof(pkt));
        while (!sx1276_tx_done);            // Wait for DIO0 interrupt
        sx1276_tx_done = 0;

        // === SLEEP ===
        SX1276_Sleep();                    // Put radio in sleep (1 µA)
        IWDG_Kick();                       // Kick before sleeping
        Enter_Stop_Mode(300);              // Sleep 5 minutes

        // === WAKE (execution resumes here) ===
        SystemClock_Config();              // Restore 80 MHz
        IWDG_Kick();                       // Kick immediately after wake
    }
}

Power Budget

Phase	Duration	Current	Charge per Cycle
Wake + Clock config	5 ms	10 mA	0.0139 μAh
I2C sensor read	50 ms	5 mA	0.0694 μAh
SPI radio TX	100 ms	120 mA	3.333 μAh
Stop Mode sleep	299.845 s	2 μA	0.1666 μAh
Total per cycle	300 s		3.583 μAh

I_avg = 3.583 μAh / (300/3600 h) = 3.583 / 0.0833 = 43 μA

Battery life = 3000 mAh / 0.043 mA = 69,767 hours = 7.97 years

Result: With a 3000 mAh 18650 battery, this design lasts nearly 8 years on a single charge. In practice, battery self-discharge (~2%/year) limits real life to ~4-5 years. Still well above the 2-year requirement.

IoT System Simulator

Watch the complete duty cycle: sleep → wake → sense → transmit → sleep. Time is accelerated 1000x.

TX interval (min) 5

Radio TX power (mA) 120

In the power budget, which phase dominates the energy consumption per cycle?

Radio TX (120 mA for 100 ms = 3.33 µAh, 93% of total) Sleep mode (2 µA for 300 seconds) Sensor read (5 mA for 50 ms) Clock configuration (10 mA for 5 ms)

Chapter 9: Mastery & Connections

You now understand the complete embedded firmware stack: from the first byte of the vector table through clock configuration, DMA transfers, peripheral drivers, power optimization, fault handling, and full IoT system design. Let's consolidate with reference tables and a design challenge.

Peripheral Configuration Cheat Sheet

Peripheral	Enable Clock	Key Registers	Typical Config
GPIO	RCC->AHB1ENR	MODER, ODR, IDR, AFR	Set MODER for mode, AFR for alternate function
SPI	RCC->APB1/2ENR	CR1, CR2, DR, SR	Master, 8-bit, CPOL/CPHA, baud prescaler
I2C	RCC->APB1ENR	CR1, CR2, DR, SR1, SR2	400 kHz, 7-bit addr, interrupt mode
UART	RCC->APB1/2ENR	CR1, BRR, DR, SR	BRR = PCLK/baud, enable RXNEIE for IRQ
ADC	RCC->APB2ENR	CR1, CR2, SQR, DR	12-bit, single conversion, DMA enable
Timer	RCC->APB1/2ENR	CR1, PSC, ARR, CCR	PSC=PCLK/desired-1, ARR=period-1
DMA	RCC->AHB1ENR	CR, PAR, M0AR, NDTR	Channel select, direction, sizes, MINC, CIRC

DMA Channel/Stream Assignment (STM32F4)

DMA	Stream	Channel	Peripheral
DMA2	Stream 0	Ch 0	ADC1
DMA2	Stream 3	Ch 3	SPI1_TX
DMA2	Stream 2	Ch 3	SPI1_RX
DMA1	Stream 5	Ch 4	USART2_RX
DMA1	Stream 6	Ch 4	USART2_TX
DMA1	Stream 0	Ch 1	I2C1_RX

Power Optimization Checklist

Before shipping any battery-powered firmware:
□ All unused GPIO pins set to analog mode (lowest leakage)
□ All unused peripheral clocks disabled
□ Debug pins (SWD) disabled in production build
□ Flash prefetch enabled during Run mode
□ Voltage regulator in low-power mode during Stop
□ LSE crystal for RTC (not LSI — 100x more accurate for long sleeps)
□ Radio in sleep mode when not transmitting
□ Sensors in forced/one-shot mode (not continuous)
□ Measure actual current with multimeter — never trust calculations alone

Design Challenge: Smart Door Lock

Your turn. Design the firmware architecture for a battery-powered smart door lock with these components:

• MCU: STM32L4 (ultra-low-power)
• BLE radio: nRF52832 module (SPI, 10 mA active, 2 μA sleep)
• Motor: DC motor with H-bridge (500 mA for 1 second to lock/unlock)
• Accelerometer: LIS3DH (I2C, wake-on-motion interrupt)
• Battery: 4x AA (6V, 2500 mAh after LDO to 3.3V)

Questions to answer:
1. What's the sleep mode? (accelerometer wake-on-motion as primary wake source)
2. What's the duty cycle? (mostly sleeping, BLE advertising only after motion detected)
3. How do you prevent unauthorized motor activation? (crypto challenge-response over BLE)
4. Estimated battery life? (Calculate: 99.99% sleep @ 5 μA + 10 unlocks/day @ 500 mA for 1s)

Comparison: Bare-Metal vs RTOS

Aspect	Bare-Metal (this lesson)	RTOS (FreeRTOS, Zephyr)
Complexity	Low (super-loop)	Medium (tasks, queues, mutexes)
RAM overhead	0 bytes	~2-8 KB (kernel + task stacks)
Timing	Deterministic (you control every cycle)	Preemptive (scheduler decides)
Concurrency	ISR + main loop	Multiple tasks + ISR
Best for	Simple sensors, tight power budgets	Complex systems (USB + BLE + display + ...)
Debug difficulty	Low	High (race conditions, priority inversion)

Where to Go Next

You've mastered bare-metal embedded C. The natural next steps:

• RTOS — FreeRTOS or Zephyr for complex multi-task systems

• Hardware design — Schematic + PCB layout to build your own boards

• Wireless protocols — BLE, LoRa, Zigbee, Thread/Matter for IoT

• Motor control — PWM, PID loops, FOC for brushless motors

"What I cannot create, I do not understand." — Richard Feynman

You now have enough knowledge to create a complete embedded system from scratch: write the startup code, configure clocks, set up DMA transfers, drive peripherals, manage power, handle faults, and design for years of battery life. Go build something.

For a battery-powered IoT sensor that reads data once every 10 minutes, which approach minimizes power consumption?

Run at full speed continuously and poll the timer Use Sleep mode with a SysTick interrupt every 1 ms Enter Stop mode and wake on RTC alarm every 10 minutes Use Standby mode and wake on external pin change

Embedded ControllerProgramming