The embedded software stack, hardware peripherals, optimization, and IoT design — from register bits to shipping firmware.
On your laptop, you write Python. You call print() and text appears. You call open("file.txt") and the OS finds a disk, locates the file system, manages memory, schedules your process among hundreds of others. You never think about how hardware works because the operating system is a thick blanket between you and the metal.
On a microcontroller, there is no OS. There is no print(). There is no file system. There is no memory manager. There is no scheduler. You ARE the operating system. Your C code talks directly to hardware through memory-mapped registers — specific addresses in memory that, when written to, physically change the behavior of silicon.
But raw register-bashing everywhere creates unmaintainable spaghetti. So embedded engineers organize code into layers:
Why bother with layers? Portability. If you swap from an STM32F4 to an STM32L4, you only rewrite the HAL. Drivers, middleware, and application code stay the same. Without layers, changing MCU means rewriting everything.
Here is what a raw register write looks like vs. a HAL call:
c // Raw register: turn on LED on PA5 (STM32F4) *(volatile uint32_t*)0x40020014 |= (1 << 5); // GPIOA->ODR bit 5 // HAL equivalent: portable, readable HAL_GPIO_WritePin(GPIOA, GPIO_PIN_5, GPIO_PIN_SET);
Both do the exact same thing: set bit 5 of the GPIOA output data register to 1, driving pin PA5 high, lighting the LED. But the HAL version works on any STM32 family without change.
Let's see this layered architecture interactively. Click each layer below to see what it does and what it calls:
Click any layer to see its responsibilities and API boundaries.
Notice how each layer only talks to the one directly below it. The application never writes to a register directly. The HAL never makes application-level decisions. This discipline is what makes firmware maintainable as projects grow to 50,000+ lines.
| Layer | Example Function | Touches Hardware? |
|---|---|---|
| Application | read_temperature() | No |
| Middleware | filter_samples(buf, len) | No |
| Driver | i2c_read(addr, reg, data, len) | Via HAL |
| HAL | HAL_I2C_Mem_Read(...) | Yes (registers) |
| Hardware | I2C peripheral silicon | IS hardware |
You write int main(void) { ... } and assume it runs. But what happens before main? On your laptop, the OS loads your binary, sets up the stack, initializes libc, and jumps to main. On a microcontroller, there is no OS to do this. A small piece of assembly code — the startup code — does it instead.
When you press reset (or power on), the Cortex-M CPU does exactly two things:
This is the vector table — a table of addresses stored at the very beginning of Flash memory. The first entry is the initial stack pointer. The second is the reset handler address. Entries 3-256+ are interrupt handler addresses (we'll use those later).
c // Vector table (simplified) — lives at address 0x08000000 in Flash uint32_t vectors[] __attribute__((section(".isr_vector"))) = { (uint32_t)&_estack, // 0x00: Initial Stack Pointer (top of SRAM) (uint32_t)Reset_Handler, // 0x04: Reset handler address (uint32_t)NMI_Handler, // 0x08: Non-maskable interrupt (uint32_t)HardFault_Handler,// 0x0C: Hard fault // ... more interrupt vectors ... };
The Reset_Handler is startup assembly code that performs these steps in order:
.data section from Flash to SRAM (initialized global variables).bss section in SRAM (uninitialized globals = 0)SystemInit() — configure clock system__libc_init_array() — C++ constructors (if any)main()Why copy .data from Flash? Because global variables like int counter = 42; need their initial value (42) stored somewhere permanent (Flash), but they live in SRAM at runtime so they can be modified. The startup code copies these initial values from Flash to SRAM.
Why zero .bss? The C standard guarantees that uninitialized globals start at zero. The startup code enforces this by memset-ing the .bss region to 0x00.
arm-asm /* Simplified Reset_Handler (ARM Cortex-M) */ Reset_Handler: /* Copy .data from Flash to SRAM */ ldr r0, =_sdata @ destination start (SRAM) ldr r1, =_edata @ destination end ldr r2, =_sidata @ source (Flash) copy_loop: cmp r0, r1 bge zero_bss ldr r3, [r2], #4 @ load word from Flash, advance str r3, [r0], #4 @ store word to SRAM, advance b copy_loop zero_bss: /* Zero .bss in SRAM */ ldr r0, =_sbss @ bss start ldr r1, =_ebss @ bss end movs r2, #0 bss_loop: cmp r0, r1 bge call_main str r2, [r0], #4 @ store zero, advance b bss_loop call_main: bl SystemInit @ configure clocks bl main @ jump to your code! b . @ infinite loop if main returns
bl main as a safety net. In practice, your main() should contain while(1) { ... } and never return.Here is the memory layout at boot. Watch the startup code copy .data and zero .bss before jumping to main:
Click "Boot" to watch the startup sequence step by step. Each phase lights up as it executes.
| Memory Region | Address Range (STM32F4) | Contents |
|---|---|---|
| Flash | 0x0800 0000 – 0x080F FFFF | Code + .data init values + vector table |
| SRAM | 0x2000 0000 – 0x2001 FFFF | .data (copied) + .bss (zeroed) + heap + stack |
| Peripherals | 0x4000 0000 – 0x5FFF FFFF | Memory-mapped registers |
| Cortex-M Core | 0xE000 0000 – 0xE00F FFFF | NVIC, SysTick, debug registers |
Every digital circuit needs a clock — a periodic signal that tells transistors when to evaluate their inputs. The MCU's clock determines how fast instructions execute, how fast peripherals run, and how much power is consumed. A faster clock = faster execution but more power. Embedded systems carefully configure clocks to balance speed and battery life.
A typical STM32 has multiple clock sources:
| Source | Frequency | Accuracy | Use Case |
|---|---|---|---|
| HSI (High-Speed Internal) | 8-16 MHz | ±1-2% | Fast boot, no external parts |
| HSE (High-Speed External) | 4-26 MHz crystal | ±20 ppm | Precise timing, USB, radio |
| LSI (Low-Speed Internal) | 32 kHz | ±5% | Watchdog timer |
| LSE (Low-Speed External) | 32.768 kHz crystal | ±20 ppm | RTC (real-time clock) |
| PLL (Phase-Locked Loop) | Up to 480 MHz | Derived from HSI/HSE | Maximum CPU speed |
The PLL is the key. It takes a reference clock (HSI or HSE) and multiplies it up to a much higher frequency. The formula:
The clock tree distributes SYSCLK to all subsystems via prescalers (dividers):
c // STM32F4 clock configuration for 168 MHz void SystemClock_Config(void) { // Enable HSE RCC->CR |= RCC_CR_HSEON; while (!(RCC->CR & RCC_CR_HSERDY)); // Wait for HSE ready // Configure PLL: source=HSE, M=8, N=336, P=2, Q=7 RCC->PLLCFGR = RCC_PLLCFGR_PLLSRC_HSE | (8 << RCC_PLLCFGR_PLLM_Pos) // M = 8 | (336 << RCC_PLLCFGR_PLLN_Pos) // N = 336 | (0 << RCC_PLLCFGR_PLLP_Pos) // P = 2 (0 = /2) | (7 << RCC_PLLCFGR_PLLQ_Pos); // Q = 7 (for USB) // Enable PLL RCC->CR |= RCC_CR_PLLON; while (!(RCC->CR & RCC_CR_PLLRDY)); // Wait for PLL lock // Configure Flash latency (5 wait states for 168 MHz) FLASH->ACR = FLASH_ACR_LATENCY_5WS | FLASH_ACR_PRFTEN | FLASH_ACR_ICEN; // Set bus prescalers: AHB=/1, APB1=/4, APB2=/2 RCC->CFGR = RCC_CFGR_HPRE_DIV1 | RCC_CFGR_PPRE1_DIV4 | RCC_CFGR_PPRE2_DIV2 | RCC_CFGR_SW_PLL; // Switch SYSCLK to PLL while ((RCC->CFGR & RCC_CFGR_SWS) != RCC_CFGR_SWS_PLL); // Wait for switch }
Try the interactive clock tree below. Adjust PLL multipliers and see all derived clocks update in real-time:
Adjust PLL_M, PLL_N, and PLL_P to see how SYSCLK and bus clocks change. Red = out of spec.
Imagine you're receiving 1024 bytes over SPI from a sensor. Without DMA, the CPU must execute a loop: read one byte from the SPI data register, store it in a buffer, repeat 1024 times. The CPU is fully occupied doing glorified copying — it can't process data, update displays, or handle other events during this time.
DMA (Direct Memory Access) is a hardware peripheral that copies data between memory and peripherals without CPU involvement. You tell the DMA controller: "Copy 1024 bytes from address X to address Y, then interrupt me when done." The CPU is free to do other work while the DMA engine handles the transfer in the background.
DMA has several modes:
| Mode | Description | Use Case |
|---|---|---|
| Normal | Transfer N items, stop, interrupt | One-shot sensor reads |
| Circular | Transfer N items, auto-restart from beginning | Continuous ADC sampling |
| Double-buffer | Alternate between two buffers automatically | Audio streaming (process buf A while filling buf B) |
| Transfer Direction | Source | Destination | Example |
|---|---|---|---|
| Peripheral → Memory | ADC_DR, SPI_DR | SRAM buffer | ADC samples to buffer |
| Memory → Peripheral | SRAM buffer | SPI_DR, DAC_DHR | Display framebuffer to SPI |
| Memory → Memory | SRAM region A | SRAM region B | Fast memcpy in hardware |
c // Configure DMA2 Stream0 to transfer 1024 ADC samples to SRAM buffer uint16_t adc_buffer[1024]; void DMA_ADC_Init(void) { // Enable DMA2 clock RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN; // Disable stream before configuring DMA2_Stream0->CR &= ~DMA_SxCR_EN; while (DMA2_Stream0->CR & DMA_SxCR_EN); // Wait until disabled // Configure stream DMA2_Stream0->CR = (0 << DMA_SxCR_CHSEL_Pos) // Channel 0 (ADC1) | DMA_SxCR_MSIZE_0 // Memory size: 16-bit | DMA_SxCR_PSIZE_0 // Peripheral size: 16-bit | DMA_SxCR_MINC // Memory address increment | DMA_SxCR_CIRC // Circular mode | DMA_SxCR_TCIE; // Transfer complete interrupt // Set addresses and count DMA2_Stream0->PAR = (uint32_t)&ADC1->DR; // Source: ADC data reg DMA2_Stream0->M0AR = (uint32_t)adc_buffer; // Dest: our buffer DMA2_Stream0->NDTR = 1024; // Number of transfers // Enable stream DMA2_Stream0->CR |= DMA_SxCR_EN; }
Watch bytes flow from the ADC peripheral to memory via DMA while the CPU independently runs application code.
The performance difference is dramatic. For 1024 16-bit ADC samples at 84 MHz APB2:
| Method | CPU Cycles Used | CPU Availability |
|---|---|---|
| Polling (no DMA) | ~10,240 cycles | 0% during transfer |
| Interrupt per sample | ~30 cycles × 1024 = 30,720 | Intermittent (context switch overhead) |
| DMA | ~50 cycles (setup + ISR) | 99.5% during transfer |
Now we combine everything: DMA, interrupts, and driver architecture to build real peripheral interfaces. Every production driver follows the same pattern: init → configure → start → ISR → callback. Let's build three complete drivers.
SPI (Serial Peripheral Interface) is a high-speed (up to 50 MHz) full-duplex bus with 4 wires: MOSI (data out), MISO (data in), SCK (clock), CS (chip select). Perfect for displays because it's fast and you're usually only sending data (display doesn't talk back much).
c // SPI1 driver for SSD1306 OLED (128x64, monochrome) // Pins: PA5=SCK, PA7=MOSI, PA4=CS, PA3=DC (data/command), PA2=RST #define SSD1306_WIDTH 128 #define SSD1306_HEIGHT 64 #define SSD1306_BUFSIZE (SSD1306_WIDTH * SSD1306_HEIGHT / 8) // 1024 bytes static uint8_t framebuffer[SSD1306_BUFSIZE]; static volatile uint8_t dma_busy = 0; void SPI1_Init(void) { // Enable clocks RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN; RCC->APB2ENR |= RCC_APB2ENR_SPI1EN; RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN; // Configure PA5, PA7 as AF5 (SPI1), PA4/PA3/PA2 as output GPIOA->MODER |= (2 << 10) | (2 << 14) // PA5, PA7 = alternate function | (1 << 8) | (1 << 6) | (1 << 4); // PA4,3,2 = output GPIOA->AFR[0] |= (5 << 20) | (5 << 28); // AF5 for PA5, PA7 // SPI config: master, 8-bit, CPOL=0, CPHA=0, baud=/4 (21 MHz) SPI1->CR1 = SPI_CR1_MSTR | SPI_CR1_SSM | SPI_CR1_SSI | (1 << SPI_CR1_BR_Pos); // Baud = APB2/4 = 84/4 = 21 MHz SPI1->CR2 = SPI_CR2_TXDMAEN; // Enable DMA for TX SPI1->CR1 |= SPI_CR1_SPE; // Enable SPI } void SSD1306_Flush(void) { while (dma_busy); // Wait for previous transfer dma_busy = 1; GPIOA->BSRR = (1 << 3); // DC pin HIGH = data mode GPIOA->BSRR = (1 << 20); // CS pin LOW (active) // Configure DMA2 Stream3 Ch3 for SPI1_TX DMA2_Stream3->CR &= ~DMA_SxCR_EN; DMA2_Stream3->CR = (3 << DMA_SxCR_CHSEL_Pos) | DMA_SxCR_MINC | DMA_SxCR_DIR_0 // Mem-to-peripheral | DMA_SxCR_TCIE; // Transfer complete IRQ DMA2_Stream3->PAR = (uint32_t)&SPI1->DR; DMA2_Stream3->M0AR = (uint32_t)framebuffer; DMA2_Stream3->NDTR = SSD1306_BUFSIZE; DMA2_Stream3->CR |= DMA_SxCR_EN; // GO! } // DMA transfer complete ISR void DMA2_Stream3_IRQHandler(void) { DMA2->LIFCR = DMA_LIFCR_CTCIF3; // Clear interrupt flag GPIOA->BSRR = (1 << 4); // CS HIGH (deselect) dma_busy = 0; // Signal completion }
I2C (Inter-Integrated Circuit) is a 2-wire bus (SDA + SCL) that supports multiple devices on one bus, each with a unique 7-bit address. Slower than SPI (100-400 kHz typically) but uses fewer pins. Perfect for sensors.
c // I2C1 interrupt-driven read from BME280 sensor (address 0x76) #define BME280_ADDR (0x76 << 1) // 7-bit addr shifted left for R/W bit static volatile uint8_t i2c_buf[8]; static volatile uint8_t i2c_idx = 0; static volatile uint8_t i2c_done = 0; void I2C1_Read_IT(uint8_t reg, uint8_t len) { i2c_idx = 0; i2c_done = 0; // Send register address (write phase) I2C1->CR1 |= I2C_CR1_START; // Generate START while (!(I2C1->SR1 & I2C_SR1_SB)); // Wait for START sent I2C1->DR = BME280_ADDR | 0; // Address + Write while (!(I2C1->SR1 & I2C_SR1_ADDR)); // Wait for ACK (void)I2C1->SR2; // Clear ADDR flag I2C1->DR = reg; // Send register address while (!(I2C1->SR1 & I2C_SR1_BTF)); // Wait for byte transferred // Restart for read phase (interrupt-driven) I2C1->CR1 |= I2C_CR1_START | I2C_CR1_ACK; I2C1->CR2 |= I2C_CR2_ITBUFEN | I2C_CR2_ITEVTEN; // Enable interrupts } void I2C1_EV_IRQHandler(void) { if (I2C1->SR1 & I2C_SR1_RXNE) { i2c_buf[i2c_idx++] = I2C1->DR; if (i2c_idx >= 6) { // BME280: 6 bytes for temp+pressure I2C1->CR1 &= ~I2C_CR1_ACK; // NACK last byte I2C1->CR1 |= I2C_CR1_STOP; // Generate STOP I2C1->CR2 &= ~(I2C_CR2_ITBUFEN | I2C_CR2_ITEVTEN); i2c_done = 1; } } }
UART (Universal Asynchronous Receiver/Transmitter) is the classic serial port. No clock wire — both sides agree on baud rate (e.g., 9600 or 115200 bits/sec). GPS modules output NMEA sentences like $GPGGA,123519,4807.038,N,01131.000,E,1,08,0.9,545.4,M,...
c // Ring buffer UART RX for GPS NMEA at 9600 baud #define RX_BUF_SIZE 256 // Must be power of 2 static volatile uint8_t rx_buf[RX_BUF_SIZE]; static volatile uint16_t rx_head = 0; // ISR writes here static volatile uint16_t rx_tail = 0; // App reads from here void USART2_IRQHandler(void) { if (USART2->SR & USART_SR_RXNE) { rx_buf[rx_head & (RX_BUF_SIZE - 1)] = USART2->DR; rx_head++; } } uint16_t UART_Available(void) { return (rx_head - rx_tail) & (RX_BUF_SIZE - 1); } uint8_t UART_ReadByte(void) { while (rx_head == rx_tail); // Block until data available uint8_t c = rx_buf[rx_tail & (RX_BUF_SIZE - 1)]; rx_tail++; return c; }
(RX_BUF_SIZE - 1) replaces expensive modulo with a fast bitwise AND. For size 256: idx & 0xFF wraps around automatically. This matters at 115200 baud where the ISR fires every 87 microseconds.Select a peripheral to see the complete data path from hardware pin to application callback.
A battery-powered embedded device that runs at full speed all the time is a dead device in hours. A 240 mAh coin cell at 50 mA lasts 4.8 hours. But the same coin cell at 2 μA lasts 13.7 years. Power management isn't optimization — it's the difference between a viable product and an expensive paperweight.
Cortex-M MCUs provide progressively deeper sleep modes, each trading more functionality for less power:
| Mode | What's Off | What's On | Wake Source | Typical Current |
|---|---|---|---|---|
| Run | Nothing | Everything | N/A (already running) | 30-100 mA |
| Sleep | CPU core | All peripherals, SRAM, clocks | Any interrupt | 5-15 mA |
| Stop | CPU, HSE, PLL, most peripherals | SRAM content, LSI, RTC, wake pins | EXTI line, RTC alarm | 10-30 μA |
| Standby | Everything (SRAM lost!) | Backup domain, wake pin logic | WKUP pin, RTC, IWDG | 1-3 μA |
c // Enter Stop mode, wake on RTC alarm in 5 minutes void Enter_Stop_Mode(uint32_t wake_seconds) { // Configure RTC wakeup timer RTC->WPR = 0xCA; RTC->WPR = 0x53; // Unlock RTC write protection RTC->CR &= ~RTC_CR_WUTE; // Disable wakeup timer while (!(RTC->ISR & RTC_ISR_WUTWF)); // Wait for access RTC->WUTR = wake_seconds - 1; // Set countdown (1 Hz clock) RTC->CR |= RTC_CR_WUTE | RTC_CR_WUTIE; // Enable timer + interrupt // Configure EXTI line 22 (RTC wakeup) for rising edge EXTI->IMR |= (1 << 22); EXTI->RTSR |= (1 << 22); // Enter Stop mode SCB->SCR |= SCB_SCR_SLEEPDEEP_Msk; // Deep sleep (not regular sleep) PWR->CR |= PWR_CR_LPDS; // Low-power voltage regulator in Stop __WFI(); // Wait For Interrupt (CPU stops here) // === Execution resumes here after wakeup === SCB->SCR &= ~SCB_SCR_SLEEPDEEP_Msk; // Clear deep sleep bit SystemClock_Config(); // Reconfigure clocks (PLL was off) }
Battery life calculation for a duty-cycled IoT sensor:
Example: Active 2 seconds at 50 mA, sleep 298 seconds at 2 μA (Stop mode):
Toggle peripherals and select sleep mode. Watch current draw and battery life update.
On a desktop, you optimize for developer time. On an MCU with 256 KB Flash, 64 KB RAM, and no FPU, you optimize for everything: cycles, bytes, and watts. A function that takes 100 cycles instead of 10 doesn't just run slower — it drains 10x more battery during execution. Optimization here is a survival skill.
Many Cortex-M0/M3 MCUs have NO floating-point unit. A single float multiply compiles to a library call of ~30 cycles. The same operation in fixed-point (integers that represent fractions) takes 1 cycle.
c // Q16.16 fixed-point: upper 16 bits = integer, lower 16 bits = fraction typedef int32_t fixed_t; #define FIXED_SHIFT 16 #define FLOAT_TO_FIXED(f) ((fixed_t)((f) * (1 << FIXED_SHIFT))) #define FIXED_TO_FLOAT(x) ((float)(x) / (1 << FIXED_SHIFT)) #define FIXED_MUL(a, b) ((fixed_t)(((int64_t)(a) * (b)) >> FIXED_SHIFT)) // Example: compute sin(x) using fixed-point Taylor series // sin(x) ≈ x - x³/6 + x⁵/120 (for small x in radians) fixed_t fixed_sin(fixed_t x) { fixed_t x2 = FIXED_MUL(x, x); fixed_t x3 = FIXED_MUL(x2, x); fixed_t x5 = FIXED_MUL(x3, x2); return x - x3 / 6 + x5 / 120; // ~3 cycles vs ~200 for float sin() }
Trade Flash for cycles. Pre-compute results and store them in a const array (lives in Flash, costs 0 RAM):
c // 256-entry sine table: sin(x) for x = 0..255 mapped to 0..2π // Values scaled to Q1.15 (multiply result by 1/32768 to get float) static const int16_t sin_table[256] = { 0, 804, 1608, 2410, 3212, 4011, 4808, 5602, 6393, 7179, 7962, 8739, 9512, 10278, 11039, 11793, // ... 256 entries total, generated offline ... 32767, 32757, 32728, /* ... */ }; // Lookup: 1 cycle (table access) vs 200 cycles (float sin) int16_t fast_sin(uint8_t angle) { return sin_table[angle]; } // Cost: 512 bytes Flash. Savings: ~199 cycles per call.
c // Naive: branch per iteration (pipeline stall on Cortex-M) for (int i = 0; i < 16; i++) dst[i] = src[i] * gain; // Unrolled x4: fewer branches, better pipeline utilization for (int i = 0; i < 16; i += 4) { dst[i] = src[i] * gain; dst[i+1] = src[i+1] * gain; dst[i+2] = src[i+2] * gain; dst[i+3] = src[i+3] * gain; } // Bit manipulation: count set bits (population count) // Naive: loop through 32 bits = 32 iterations uint32_t popcount_naive(uint32_t x) { uint32_t count = 0; while (x) { count += x & 1; x >>= 1; } return count; } // Optimized: Kernighan's trick — only loops for SET bits uint32_t popcount_fast(uint32_t x) { uint32_t count = 0; while (x) { x &= x - 1; count++; } // Clears lowest set bit each iter return count; }
| Flag | Optimizes For | Effect |
|---|---|---|
-O0 | Debugability | No optimization, 1:1 source mapping |
-O2 | Speed | Aggressive: inlining, unrolling, scheduling |
-Os | Size (Flash) | Like -O2 but skips optimizations that increase size |
-Og | Debug + some speed | Moderate optimization, good debugging |
-flto | Cross-file optimization | Link-time optimization, removes unused code |
-Os is the best default. It produces code that's 20-40% smaller than -O2 with only 5-10% speed penalty. When your MCU has 64 KB Flash total, those saved bytes matter. Use -O2 only for specific hot functions via __attribute__((optimize("O2"))).Select a function, then apply optimization passes. Watch cycle count, Flash usage, and RAM usage change with each technique.
Embedded systems must work for years without human intervention. But bugs happen: infinite loops, null pointer dereferences, stack overflows, corrupted state. On your laptop, the OS kills the process. On an MCU with no OS, a bug means the device hangs forever — unless you've planned for failure.
The Independent Watchdog (IWDG) is a hardware timer that counts down. Your code must periodically "kick" (reset) it before it reaches zero. If your code crashes or hangs, it can't kick the watchdog, the timer expires, and the hardware forces a full system reset. It's a dead man's switch.
c // Configure IWDG for ~4 second timeout // IWDG runs on LSI (32 kHz), independent of system clock void IWDG_Init(void) { IWDG->KR = 0x5555; // Enable register access IWDG->PR = 6; // Prescaler /256 → 32000/256 = 125 Hz IWDG->RLR = 500; // Reload = 500 → timeout = 500/125 = 4 seconds IWDG->KR = 0xCCCC; // Start watchdog (CANNOT be stopped once started!) } // Call this in your main loop — if you don't call it within 4s, reset! void IWDG_Kick(void) { IWDG->KR = 0xAAAA; // Reload counter (kick the dog) } // Typical usage pattern: int main(void) { SystemInit(); IWDG_Init(); while (1) { read_sensors(); // If this hangs → watchdog fires → reset process_data(); transmit(); IWDG_Kick(); // "I'm still alive!" Enter_Stop_Mode(300); } }
When the CPU hits an illegal operation — null pointer dereference, unaligned access, divide by zero, stack overflow, bus error — it triggers a HardFault exception. By default this is an infinite loop (device hangs). A good HardFault handler logs the fault information for debugging:
c // HardFault handler that captures useful debug info typedef struct { uint32_t r0, r1, r2, r3, r12, lr, pc, psr; } StackFrame_t; void HardFault_Handler_C(StackFrame_t *frame) { // frame->pc = the instruction that caused the fault // frame->lr = the return address (who called the faulting function) volatile uint32_t cfsr = SCB->CFSR; // Configurable Fault Status Register volatile uint32_t hfsr = SCB->HFSR; // HardFault Status Register volatile uint32_t mmfar = SCB->MMFAR; // MemManage Fault Address volatile uint32_t bfar = SCB->BFAR; // Bus Fault Address // Decode fault type if (cfsr & 0x0001) { /* IACCVIOL: instruction access violation */ } if (cfsr & 0x0002) { /* DACCVIOL: data access violation (null ptr?) */ } if (cfsr & 0x0800) { /* UNSTKERR: stack overflow during exception */ } if (cfsr & 0x0200 0000) { /* DIVBYZERO: divide by zero */ } // Log to backup SRAM (survives reset) for post-mortem debugging *(uint32_t*)0x40024000 = frame->pc; // Faulting PC *(uint32_t*)0x40024004 = cfsr; // Fault type NVIC_SystemReset(); // Reset and hope for the best }
Trigger different faults and watch the MCU respond. Green = running. Red = fault. Blue = reset.
Let's design a complete IoT product: an environmental monitoring node that measures temperature, humidity, and pressure, then transmits data wirelessly every 5 minutes. It must run for 2+ years on a single battery. This is a real product architecture used in agriculture, building management, and industrial monitoring.
| Component | Interface | MCU Pins | Speed |
|---|---|---|---|
| BME280 | I2C1 | PB6 (SCL), PB7 (SDA) | 400 kHz |
| SX1276 | SPI1 | PA5 (SCK), PA6 (MISO), PA7 (MOSI), PA4 (NSS) | 10 MHz |
| SX1276 DIO0 | EXTI | PC4 (TX done interrupt) | N/A |
| Status LED | GPIO | PA0 (active low) | N/A |
c // Main firmware loop — complete IoT sensor node typedef struct { int16_t temperature; // 0.01 °C resolution (2345 = 23.45°C) uint16_t humidity; // 0.01 %RH resolution uint32_t pressure; // Pa (101325 = 1013.25 hPa) uint16_t battery_mv; // Battery voltage in mV uint16_t seq_num; // Packet sequence number } SensorPacket_t; // 12 bytes total int main(void) { HAL_Init(); SystemClock_Config(); // 80 MHz from HSE+PLL GPIO_Init(); I2C1_Init(); // 400 kHz for BME280 SPI1_Init(); // 10 MHz for SX1276 RTC_Init(); // LSE 32.768 kHz crystal IWDG_Init(); // 8 second timeout BME280_Init(); // Configure oversampling, filter SX1276_Init(); // Configure LoRa: SF7, BW125, CR4/5 uint16_t seq = 0; while (1) { // === SENSE === SensorPacket_t pkt; BME280_TriggerMeasurement(); // Start forced conversion HAL_Delay(50); // Wait for measurement BME280_Read(&pkt.temperature, &pkt.humidity, &pkt.pressure); pkt.battery_mv = ADC_ReadBattery(); pkt.seq_num = seq++; // === TRANSMIT === SX1276_Transmit((uint8_t*)&pkt, sizeof(pkt)); while (!sx1276_tx_done); // Wait for DIO0 interrupt sx1276_tx_done = 0; // === SLEEP === SX1276_Sleep(); // Put radio in sleep (1 µA) IWDG_Kick(); // Kick before sleeping Enter_Stop_Mode(300); // Sleep 5 minutes // === WAKE (execution resumes here) === SystemClock_Config(); // Restore 80 MHz IWDG_Kick(); // Kick immediately after wake } }
| Phase | Duration | Current | Charge per Cycle |
|---|---|---|---|
| Wake + Clock config | 5 ms | 10 mA | 0.0139 μAh |
| I2C sensor read | 50 ms | 5 mA | 0.0694 μAh |
| SPI radio TX | 100 ms | 120 mA | 3.333 μAh |
| Stop Mode sleep | 299.845 s | 2 μA | 0.1666 μAh |
| Total per cycle | 300 s | 3.583 μAh |
Watch the complete duty cycle: sleep → wake → sense → transmit → sleep. Time is accelerated 1000x.
You now understand the complete embedded firmware stack: from the first byte of the vector table through clock configuration, DMA transfers, peripheral drivers, power optimization, fault handling, and full IoT system design. Let's consolidate with reference tables and a design challenge.
| Peripheral | Enable Clock | Key Registers | Typical Config |
|---|---|---|---|
| GPIO | RCC->AHB1ENR | MODER, ODR, IDR, AFR | Set MODER for mode, AFR for alternate function |
| SPI | RCC->APB1/2ENR | CR1, CR2, DR, SR | Master, 8-bit, CPOL/CPHA, baud prescaler |
| I2C | RCC->APB1ENR | CR1, CR2, DR, SR1, SR2 | 400 kHz, 7-bit addr, interrupt mode |
| UART | RCC->APB1/2ENR | CR1, BRR, DR, SR | BRR = PCLK/baud, enable RXNEIE for IRQ |
| ADC | RCC->APB2ENR | CR1, CR2, SQR, DR | 12-bit, single conversion, DMA enable |
| Timer | RCC->APB1/2ENR | CR1, PSC, ARR, CCR | PSC=PCLK/desired-1, ARR=period-1 |
| DMA | RCC->AHB1ENR | CR, PAR, M0AR, NDTR | Channel select, direction, sizes, MINC, CIRC |
| DMA | Stream | Channel | Peripheral |
|---|---|---|---|
| DMA2 | Stream 0 | Ch 0 | ADC1 |
| DMA2 | Stream 3 | Ch 3 | SPI1_TX |
| DMA2 | Stream 2 | Ch 3 | SPI1_RX |
| DMA1 | Stream 5 | Ch 4 | USART2_RX |
| DMA1 | Stream 6 | Ch 4 | USART2_TX |
| DMA1 | Stream 0 | Ch 1 | I2C1_RX |
| Aspect | Bare-Metal (this lesson) | RTOS (FreeRTOS, Zephyr) |
|---|---|---|
| Complexity | Low (super-loop) | Medium (tasks, queues, mutexes) |
| RAM overhead | 0 bytes | ~2-8 KB (kernel + task stacks) |
| Timing | Deterministic (you control every cycle) | Preemptive (scheduler decides) |
| Concurrency | ISR + main loop | Multiple tasks + ISR |
| Best for | Simple sensors, tight power budgets | Complex systems (USB + BLE + display + ...) |
| Debug difficulty | Low | High (race conditions, priority inversion) |
You've mastered bare-metal embedded C. The natural next steps:
• RTOS — FreeRTOS or Zephyr for complex multi-task systems
• Hardware design — Schematic + PCB layout to build your own boards
• Wireless protocols — BLE, LoRa, Zigbee, Thread/Matter for IoT
• Motor control — PWM, PID loops, FOC for brushless motors