Encoding & Evolution — DDIA Chapter 5

Chapter 0: The Problem

Your service stores user profiles as JSON. Version 1 looks like this:

json
{
  "userId": 42,
  "name": "Alice",
  "email": "alice@example.com"
}

Today you ship version 2 with a new field:

json
{
  "userId": 42,
  "name": "Alice",
  "email": "alice@example.com",
  "phoneNumber": "+1-555-0123"
}

Seems harmless. But here is the nightmare scenario: you have 50 servers, and you deploy the new code via a rolling deployment — one server at a time, over 30 minutes. For those 30 minutes, some servers run v1 (old code, no phone_number) and some run v2 (new code, expects phone_number). All of them read and write to the same database.

Now consider the chaos:

Writer	Reader	What happens
v2 server writes phone_number	v1 server reads it	v1 sees an unknown field. Does it crash? Ignore it? Silently drop it on the next write?
v1 server writes (no phone_number)	v2 server reads it	v2 expects phone_number. Does it crash? Use a default? Show blank in the UI?
v2 server reads, then writes back	v1 server reads the result	If v1 re-writes the record, does it strip phone_number? Data loss.

And it gets worse. This isn't just about databases. The same version mismatch happens with:

REST APIs — old mobile clients calling new servers (or vice versa)
Message queues — a message written by v2 consumed by a v1 worker hours later
Batch jobs — a Spark job reading data written years ago in a format that no longer exists

The core question. How do you change the shape of your data over time without breaking systems that are still using the old shape? This problem is called schema evolution, and solving it properly is one of the most underrated skills in distributed systems engineering. The encoding format you choose determines whether evolution is painful or painless.

Watch the Rolling Deployment

The simulation below shows a rolling deployment in progress. Six servers are being upgraded from v1 to v2 one at a time. All of them read and write to the same database. Watch how old and new code coexist, and notice the window of danger when both versions are active simultaneously.

Rolling Deployment: Old and New Code Coexist

Watch servers upgrade one by one. Observe read/write conflicts between versions.

6 servers running v1. Click "Start Deployment" to begin rolling upgrade.

During the upgrade window, every database write is a potential compatibility issue. A v2 server writes a record with phone_number. Moments later a v1 server reads it, ignores the unknown field, then writes it back — erasing phone_number forever. This silent data loss is one of the most common bugs in production distributed systems.

Two Kinds of Compatibility

To survive rolling deployments, your data format must support two properties:

Forward compatibility: old code can read data written by new code. The v1 server encounters phone_number. It doesn't know what phone_number is, but it doesn't crash — it ignores the unknown field and preserves it when writing back.

Backward compatibility: new code can read data written by old code. The v2 server reads a record without phone_number. It doesn't crash — it fills in a sensible default (null, or an empty string) and continues.

Some encoding formats make both of these easy. Some make them impossible. That is the subject of this lesson.

Concept check: During a rolling deployment, a v2 server writes a user record with a new "phone_number" field. A v1 server later reads that record, updates the user's email, and writes it back. What is the most dangerous outcome if the encoding format does not support forward compatibility?

The v1 server crashes when it encounters the unknown field The v1 server ignores the field but logs a warning The v1 server silently drops phone_number when it writes the record back, causing permanent data loss that no one notices until a user complains

Chapter 1: In-Memory vs On-Wire

Before we compare JSON vs Protobuf vs Avro, we need to understand why encoding exists at all. The answer comes down to two fundamentally different representations of data.

Two Worlds

In memory, data lives as objects, structs, hash maps, arrays, and trees. These structures are optimized for CPU access: pointers let you jump to any field in O(1), data can reference other data via shared pointers, and the layout is designed for the memory hierarchy (cache lines, page alignment). A Python dict, a Java HashMap, a Go struct — these are all in-memory representations.

On the wire (or on disk), data must become a flat sequence of bytes. You cannot send a pointer over the network — the receiving process has a completely different memory space. The byte sequence must be self-contained: it carries all the information needed to reconstruct the original data, with no external references.

The translation from in-memory to bytes is called encoding (also called serialization or marshalling). The reverse — bytes back to in-memory objects — is decoding (also called deserialization, parsing, or unmarshalling).

Why not just dump the raw memory? You might think: "just memcpy the struct and send it." This fails for many reasons: (1) pointer values are meaningless to the receiver, (2) different languages lay out objects differently, (3) different CPU architectures have different endianness (byte order), (4) the receiver might be running a different version of the code with a different struct definition. Encoding exists precisely to abstract away these machine-level details and produce a portable, version-independent representation.

Language-Specific Formats: A Cautionary Tale

Many programming languages offer built-in serialization: Java has java.io.Serializable, Python has pickle, Ruby has Marshal. These are convenient — you can serialize any in-memory object in one line of code. But they have severe problems:

Problem	Why it matters
Language lock-in	A Python pickle can only be read by Python. If your analytics team uses Java or your mobile app uses Kotlin, they cannot read the data.
Security	Deserializing arbitrary bytes can execute code. Python's pickle.loads() will happily run `os.system("rm -rf /")` if the attacker crafted the bytes. This is not hypothetical — pickle deserialization attacks are a standard exploit class.
Versioning	If you add a field to a class and try to deserialize old data, you often get cryptic errors. There is no built-in mechanism for forward or backward compatibility.
Performance	These formats are often slower and produce larger outputs than purpose-built alternatives like Protobuf.

The rule. Never use language-specific serialization for anything that crosses a process boundary — not for APIs, not for databases, not for message queues. Not even for "temporary" caches that "only our Python service reads." Systems evolve. Languages change. Use a language-independent format.

The Encoding Pipeline

The simulation below shows an in-memory object being encoded into bytes, transmitted, and decoded back into an in-memory object. Notice the information that must be preserved (field names, types, values) and the information that is lost (memory addresses, method definitions, private internal state).

The Encoding Pipeline: Object → Bytes → Object

Click Encode to serialize the object, then Transmit to send, then Decode to reconstruct.

Step 1: In-memory object with fields, pointers, and methods.

What Gets Preserved, What Gets Lost

Information	In Memory	On Wire
Field names and values	Yes	Yes (must be preserved)
Data types (int, string, bool)	Yes (implicit from language)	Depends on format
Memory addresses (pointers)	Yes	No (meaningless to receiver)
Methods / class behavior	Yes	No (receiver has its own code)
Schema / version info	Implicit	Depends on format

The encoding format you choose determines which columns say "Yes" and which say "Depends." JSON preserves field names as strings but loses precise types (is 42 an int32 or an int64?). Protobuf preserves field tags (numbers, not names) and precise types, but requires a schema to decode. Avro preserves field names and types but requires the writer's schema at read time. Each choice has consequences for compatibility, performance, and usability.

Debug scenario: A team serializes ML model weights using Python pickle and stores them in S3. Six months later, the ML platform team migrates from Python 3.9 to 3.11, and the research team switches from a custom Tensor class to PyTorch tensors. Old pickled models no longer load. What is the root cause, and what format should they have used?

S3 corrupted the bytes during storage Pickle encodes the class definition path (e.g., mylib.Tensor). When the class was renamed or moved, deserialization fails because pickle tries to import the exact original class. They should use a language-independent format with an explicit schema like SafeTensors, ONNX, or Protobuf Python 3.11 changed the pickle wire format

Chapter 2: JSON, XML, CSV

If language-specific formats are a trap, what should you use? The first answer most developers reach for is a human-readable text format: JSON, XML, or CSV. These formats are widely supported, easy to debug (you can open them in a text editor), and language-independent. But they each have subtle problems that bite you at scale.

JSON: The Lingua Franca

JSON (JavaScript Object Notation) is the dominant format for web APIs. Its strengths are real: every language has a JSON library, it is human-readable, and it supports nested objects and arrays. But JSON has three landmines that every systems engineer must know about.

Landmine 1: Number ambiguity. JSON does not distinguish between integers and floating-point numbers. The value 42 might be parsed as a 32-bit int, a 64-bit int, or a 64-bit float — it depends entirely on the parser implementation. This matters enormously for large integers.

The 2⁵³ problem. JavaScript's Number type is an IEEE 754 double-precision float with 53 bits of mantissa. Any integer larger than 2⁵³ (9,007,199,254,740,992) loses precision when parsed by a JavaScript client. If your database assigns 64-bit integer IDs, a JavaScript front-end will silently corrupt them. Twitter famously hit this bug: their tweet IDs exceeded 2⁵³, and JavaScript clients started returning wrong IDs. The fix? Send large integers as strings in JSON. Twitter's API returns both "id": 12345678901234567 and "id_str": "12345678901234567".

Landmine 2: No binary support. JSON can only represent text. If you need to embed binary data (images, encrypted blobs, compressed data), you must encode it as Base64 — which inflates the data by ~33%. A 1 MB image becomes ~1.37 MB in the JSON payload.

Landmine 3: No schema. JSON is schemaless by default. A field can be a string in one record and a number in another. There is no enforcement mechanism built into the format. JSON Schema exists but it is a documentation tool, not a wire-format enforcement tool — there is nothing in the JSON bytes that says "this field must be an integer."

XML: Powerful but Verbose

XML (Extensible Markup Language) was the dominant data format before JSON took over. It supports schemas (XSD), namespaces, attributes vs. elements, and XSLT transformations. But it is verbose — every value is wrapped in opening and closing tags — and the schema language (XSD) is notoriously complex. A simple user profile in XML is roughly 2-3x the size of the equivalent JSON.

CSV: Fragile at Best

CSV (Comma-Separated Values) is barely a format at all. There is no official standard (RFC 4180 is a best-effort attempt). There is no schema. There is no type information. If a value contains a comma, you quote it. If a value contains a quote, you escape it. If a value contains a newline... it depends on the parser. CSV is fine for exporting spreadsheets. It is not a serious option for inter-service communication.

Size Comparison

The simulation below encodes the same data in JSON, XML, and CSV. Watch the byte counts. Also notice where JSON silently corrupts a large integer.

The Same Data in Three Formats

Compare byte sizes. Watch the integer precision problem in JSON.

Showing user profile with normal integer ID (42). Click "Toggle Large Integer" to see the 2^53 problem.

Worked Example: Encoding a User Profile

json
// JSON: 67 bytes
{"userId":42,"name":"Alice","email":"alice@example.com"}

xml
<!-- XML: 138 bytes -->
<user>
  <userId>42</userId>
  <name>Alice</name>
  <email>alice@example.com</email>
</user>

csv
# CSV: 40 bytes (but no schema, no types, no nesting)
42,Alice,alice@example.com

CSV is smallest, but it cannot represent nested data, has no type information, and breaks if any field contains a comma. XML is largest because of the tag overhead. JSON is the sweet spot for human-readability, but still suffers from the type ambiguity and size overhead we discussed.

When JSON is the right choice. JSON is still the best choice for public-facing APIs where human readability matters, where clients are diverse (browsers, mobile apps, third-party integrations), and where bandwidth is not the bottleneck. The problems with JSON only matter at scale — millions of messages per second, precise type requirements, or tight bandwidth constraints. For everything else, JSON is fine.

Design question: You are building an API that returns financial transaction IDs. These IDs are 64-bit integers assigned by a Postgres BIGSERIAL column. Your frontend is a React app. A colleague says "just return them as JSON numbers." Why is this dangerous, and what should you do instead?

JSON numbers are always 32-bit, so they overflow JSON is too slow for financial data JavaScript parses JSON numbers as IEEE 754 doubles, which have only 53 bits of integer precision. Any transaction ID above 2^53 (about 9 quadrillion) will silently round to a nearby representable value, producing wrong IDs. Return the ID as a string: "transactionId": "9007199254740993" and parse it with BigInt on the client

Chapter 3: Binary Encoding

JSON is human-readable but wastes bytes. Every field name is repeated in full text for every single record. The string "favoriteNumber" is 16 bytes — and you pay that cost in every message, every record, millions of times per day. Binary encoding formats solve this by replacing text with compact binary representations.

MessagePack: Binary JSON

MessagePack is the simplest binary format. It is essentially JSON with a binary wire format: field names are still stored as strings, but values are encoded more compactly (integers use variable-length encoding, strings carry explicit lengths instead of quote delimiters). A JSON object like {"userName":"Martin","favoriteNumber":1337} shrinks from ~45 bytes to ~33 bytes.

MessagePack is a quick win — smaller than JSON, faster to parse — but it still carries full field names. The real savings come from formats that eliminate field names entirely.

Protocol Buffers and Thrift: Field Tags Replace Names

Protocol Buffers (Protobuf, created at Google) and Apache Thrift (created at Facebook) take a different approach. Instead of encoding field names, they assign each field a numeric tag in a schema definition. The schema is shared between sender and receiver ahead of time. The wire format only carries the tag numbers, not the names.

protobuf
// Protocol Buffers schema (.proto file)
message Person {
  string user_name = 1;      // tag 1
  int64 favorite_number = 2;  // tag 2
}

thrift
// Thrift schema (.thrift file)
struct Person {
  1: string userName,
  2: i64 favoriteNumber,
}

When this message is encoded, the bytes contain (tag=1, type=string, value="Martin") and (tag=2, type=varint, value=1337). The tag is a small integer (1-2 bytes), not a 14-character string. This saves enormous amounts of space when you have millions of records.

Varint Encoding: How Protobuf Packs Integers

Protobuf does not use fixed-width integers. The number 1 should not cost 8 bytes just because the schema says int64. Instead, Protobuf uses varint encoding: each byte uses 7 bits for data and 1 bit (the MSB, most significant bit) as a continuation flag. If the MSB is 1, more bytes follow. If it is 0, this is the last byte.

// Encoding the integer 300 as a varint:
300 in binary: 100101100
Split into 7-bit groups (from right): 0101100 | 0000010
Add MSB continuation flags:
First byte: 1|0101100 = 0xAC (MSB=1: more bytes follow)
Second byte: 0|0000010 = 0x02 (MSB=0: last byte)
Wire bytes: [0xAC, 0x02] = 2 bytes

// Compare: a fixed 64-bit integer would be 8 bytes
// Varint for 1: just [0x01] = 1 byte
// Varint for 300: [0xAC, 0x02] = 2 bytes
// Varint for 1,000,000: [0xC0, 0x84, 0x3D] = 3 bytes

For signed integers, Protobuf uses ZigZag encoding to map signed values to unsigned varints: 0 → 0, -1 → 1, 1 → 2, -2 → 3, 2 → 4, and so on. This keeps small negative numbers small (without ZigZag, -1 as a 64-bit two's complement would encode as a 10-byte varint).

zigzag(n) = (n << 1) ^ (n >> 63) // for 64-bit signed
// -1 → (0xFFFFFFFFFFFFFFFE ^ 0xFFFFFFFFFFFFFFFF) = 1
// 1 → (0x0000000000000002 ^ 0x0000000000000000) = 2

Thrift: Two Wire Formats

Thrift offers two encoding formats with different trade-offs:

Format	Field header	Integer encoding	Our example
BinaryProtocol	type (1 byte) + tag (2 bytes) = 3 bytes	Fixed-width (i64 = 8 bytes always)	~35 bytes
CompactProtocol	type+tag in 1-2 bytes (delta encoding)	Varint + ZigZag	~27 bytes

CompactProtocol packs the field type and tag delta into a single byte when the tag delta is small (which it usually is, since tags are sequential). This makes it almost as compact as Protobuf.

Interactive Binary Encoder

The simulation below encodes the same record in four formats. Watch how field names disappear in Protobuf and Thrift, replaced by tiny tag numbers. Compare the final byte counts.

Binary Encoding Comparison

Same data: {"userName": "Martin", "favoriteNumber": 1337}. Compare byte sizes across formats.

Click a format to highlight its bytes. Notice how binary formats eliminate field name overhead.

Worked Example: Byte-by-Byte Protobuf Encoding

Let us encode {"userName": "Martin", "favoriteNumber": 1337} in Protobuf, byte by byte:

// Field 1: user_name = "Martin" (tag=1, wire type=2 (length-delimited))
Byte 1: 0x0A = (1 << 3) | 2 = tag 1, wire type 2
Byte 2: 0x06 = length 6 (string is 6 bytes)
Bytes 3-8: 0x4D 0x61 0x72 0x74 0x69 0x6E = "Martin" in UTF-8

// Field 2: favorite_number = 1337 (tag=2, wire type=0 (varint))
Byte 9: 0x10 = (2 << 3) | 0 = tag 2, wire type 0
Bytes 10-11: 0xB9 0x0A = varint encoding of 1337

// Total: 11 bytes. Compare: JSON was ~45 bytes.

Eleven bytes. The JSON version was roughly 45 bytes. That is a 4x reduction for a single tiny record. At a million records per second, that is 34 MB/s of saved bandwidth. At a billion records in a database, that is 34 GB of saved storage.

python
# Varint encoder in Python — interview coding drill
def encode_varint(value):
    """Encode a non-negative integer as a Protobuf varint."""
    result = []
    while value > 0x7F:
        result.append((value & 0x7F) | 0x80)  # 7 data bits + MSB=1
        value >>= 7
    result.append(value & 0x7F)  # last byte: MSB=0
    return bytes(result)

def zigzag_encode(value):
    """ZigZag: map signed int to unsigned for efficient varint."""
    return (value << 1) ^ (value >> 63)

# Test
print(encode_varint(1).hex())       # "01"      (1 byte)
print(encode_varint(300).hex())     # "ac02"    (2 bytes)
print(encode_varint(1337).hex())    # "b90a"    (2 bytes)
print(zigzag_encode(-1))             # 1
print(zigzag_encode(1))              # 2

Coding drill: In Protobuf varint encoding, how many bytes does the integer 150 require? Work it out by hand.

1 byte (150 fits in 8 bits) 2 bytes. 150 in binary is 10010110, which is 8 bits — more than 7. Split into 7-bit groups: 0010110 | 0000001. Add MSB flags: 10010110 (0x96) and 00000001 (0x01). Total: 2 bytes [0x96, 0x01] 4 bytes (one byte per decimal digit, plus a terminator)

Chapter 4: Schema Evolution

We have established that Protobuf and Thrift are smaller and faster than JSON. But the real reason to use them is schema evolution. Because these formats use numeric field tags instead of string names, and because they have explicit rules for handling unknown fields, they make it possible to evolve your data format safely over time.

The Five Rules of Protobuf/Thrift Evolution

These rules determine what changes are safe and what changes will break your system. Memorize them for interviews.

Rule 1: You can add new fields with new tag numbers. If you add a field with tag 3 to a schema that only had tags 1 and 2, old code will simply skip tag 3 when it encounters it in the data. The field is unknown, and the decoder skips it using the wire type to determine how many bytes to consume. New code, when reading old data, will see that tag 3 is missing and use the default value.

Rule 2: You cannot rename fields — but you don't need to. Field names only exist in the schema definition, not in the wire format. The name user_name is purely for programmer convenience. You can rename it to display_name in your .proto file without changing a single byte of encoded data, as long as the tag number stays the same.

Rule 3: You cannot change a field's tag number. The tag IS the field's identity. Changing tag 2 from favorite_number to tag 5 means every existing encoded record still says "tag 2 = 1337," but your new code looks for tag 5 and finds nothing. The data is lost.

Rule 4: Never add a required field after initial deployment. If you add a required field, old code will produce records without it. When new code reads those records, it will reject them because the required field is missing. This breaks backward compatibility. In Protobuf 3, all fields are optional by default — Google learned this lesson the hard way.

Rule 5: You can change types if the wire format is compatible. Changing int32 to int64 is safe — both use varint encoding, and the bits are compatible. But reading a 64-bit value with 32-bit code may truncate (the high bits are silently discarded). Changing int32 to string is not safe — the wire types are different and the decoder will misinterpret the bytes.

Forward vs backward in Protobuf. Forward compatibility (old code reads new data) works because unknown tags are skipped. Backward compatibility (new code reads old data) works because missing fields get defaults. Both properties hold only if you follow the five rules. Break any rule, and one or both properties fail.

Schema Evolution Visualizer

The simulation below shows two schema versions side by side. Try different evolution operations and see which ones are safe (green) and which ones break compatibility (red). This is exactly the kind of reasoning you must do in a system design interview when discussing API versioning.

Schema Evolution: Safe vs Breaking Changes

Try different schema changes. Green = safe, Red = breaks compatibility.

Original schema: 2 fields (tag 1: string, tag 2: int64). Try a change.

Evolution Rules Summary

Change	Forward compatible?	Backward compatible?	Safe?
Add optional field (new tag)	Yes (old code skips unknown tag)	Yes (new code uses default)	Yes
Add required field	Yes	No (old data missing required field)	No
Remove optional field	Yes (new code uses default)	Yes (old code skips the missing tag)	Yes (but never reuse the tag number)
Rename field	Yes (wire uses tags not names)	Yes	Yes
Change tag number	No (old data uses old tag)	No (new code looks for new tag)	No
int32 → int64	Yes (but may truncate)	Yes	Careful
int32 → string	No (wire type mismatch)	No	No

The "never reuse tag numbers" rule. When you remove a field, you must mark its tag number as reserved in the schema. If a future developer accidentally reuses that tag for a new field with a different type, old data in the database (which still uses that tag with the old type) will be misinterpreted. Protobuf has a reserved keyword for exactly this purpose: reserved 3, 4; reserved "old_field_name";

Debug scenario: You add a new required string phone_number = 3; field to your Protobuf schema and deploy the new server. Immediately, 30% of requests start failing with deserialization errors. Why, and what is the fix?

The phone_number field is too large for the wire format Existing records in the database and in-flight messages from old clients do not have tag 3 (phone_number). Because the field is required, the new server's decoder rejects every record that was written before the schema change. The fix: make it optional with a default value. In Protobuf 3, all fields are implicitly optional — this is why Google removed the required keyword entirely Tag number 3 conflicts with an internal Protobuf reserved number

Chapter 5: Apache Avro

Protobuf and Thrift solve schema evolution with field tags. Apache Avro takes a radically different approach: no tags at all. Fields are matched by name, and the encoded data contains no field identifiers whatsoever — not tags, not names, nothing. Instead, the writer's schema and reader's schema are resolved at read time.

How Avro Encoding Works

An Avro schema looks like this:

json
{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName", "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null}
  ]
}

When Avro encodes this record, it writes the fields in schema order with no field identifiers:

// Avro encoding of {userName: "Martin", favoriteNumber: 1337}
Bytes 1: 0x0C = varint length of string "Martin" (6, ZigZag-encoded as 12)
Bytes 2-7: 0x4D 0x61 0x72 0x74 0x69 0x6E = "Martin" in UTF-8
Byte 8: 0x02 = union index 1 (the "long" branch of ["null","long"])
Bytes 9-10: 0xF2 0x14 = ZigZag varint of 1337

// Total: 10 bytes. Even smaller than Protobuf's 11 bytes.
// Because: no field tags, no wire type indicators.

But wait — if there are no tags and no names, how does the reader know which bytes correspond to which field? The answer: the reader must have the writer's schema. Without it, the bytes are meaningless — just a stream of values with no field identity.

Schema Resolution: Writer vs Reader

This is Avro's key innovation. The writer and reader can have different schemas, and Avro resolves them at read time by matching fields by name:

Writer Schema (v1)

userName (string), favoriteNumber (long)

↓ bytes encoded in v1 field order

Reader Schema (v2)

userName (string), favoriteNumber (long), email (string, default="")

↓

Resolution

userName: matched by name → read from bytes. favoriteNumber: matched by name → read from bytes. email: not in writer's schema → use default "".

And in reverse:

Writer Schema (v2)

userName, favoriteNumber, email

↓ bytes encoded in v2 field order

Reader Schema (v1)

userName, favoriteNumber

↓

Resolution

userName: matched → read. favoriteNumber: matched → read. email: in writer but not reader → skip those bytes.

The catch. Every field that you might add or remove must have a default value in the schema. If a reader encounters a missing field with no default, Avro throws an error. This is the Avro equivalent of "never add a required field" in Protobuf — but enforced structurally through the type system rather than through a convention.

How Does the Reader Get the Writer's Schema?

Three strategies, each suited to a different dataflow pattern:

Strategy	How it works	Best for
Avro container file	The writer's schema is embedded at the beginning of the file. All records in the file use that schema. One schema per file.	Hadoop, data lake files, batch processing
Connection negotiation	Writer and reader exchange schemas during connection setup (Avro RPC protocol).	RPC between long-lived services
Schema registry	Writer stores its schema in a central registry and embeds the schema ID in each message. Reader looks up the schema by ID.	Kafka + Avro (the standard pattern)

Avro Schema Resolution Visualizer

The simulation below shows writer and reader schemas side by side. Watch how Avro matches fields by name, fills in defaults for missing fields, and skips fields the reader does not know about.

Avro Schema Resolution

Writer and reader schemas differ. Watch field matching, defaults, and skipping.

Writer schema v1: userName, favoriteNumber. Reader schema v1: same. Click a scenario.

Why Avro? Dynamic Schema Generation

Avro's killer feature is that schemas can be generated programmatically. Consider a database table:

sql
CREATE TABLE users (
  user_id  BIGINT,
  name     VARCHAR(255),
  email    VARCHAR(255)
);

You can automatically generate an Avro schema from this DDL. If someone runs ALTER TABLE users ADD COLUMN phone TEXT;, a new Avro schema is generated automatically. The old and new schemas coexist through Avro's resolution mechanism. No one has to manually assign tag numbers. No one can accidentally reuse a tag. The database schema is the serialization schema.

This is why Avro dominates in the Hadoop ecosystem: data pipelines that export database tables to HDFS can evolve their schemas automatically as the source database evolves.

Avro vs Protobuf: When to Use Which

Property	Avro	Protobuf
Field identity	Names (resolved at read time)	Numeric tags (embedded in bytes)
Schema in bytes?	No — needs writer schema from elsewhere	No — but tags are self-identifying
Dynamic schemas	Excellent (generate from DB DDL)	Poor (must assign tag numbers manually)
Smallest encoding	Yes (no tags, no names in bytes)	Close (1-2 bytes per field for tag+type)
Random access	No (must read in schema order)	Yes (can skip to any field by tag)
Code generation	Optional (reflection works)	Primary workflow (protoc generates stubs)
Best for	Kafka, Hadoop, data pipelines	gRPC, mobile, microservices

python
# Writing and reading Avro with schema evolution
import avro.schema, avro.io, avro.datafile
import io

# Writer schema v1
writer_schema = avro.schema.parse("""
{
  "type": "record", "name": "User",
  "fields": [
    {"name": "userName", "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null}
  ]
}
""")

# Encode a record
buf = io.BytesIO()
encoder = avro.io.BinaryEncoder(buf)
writer = avro.io.DatumWriter(writer_schema)
writer.write({"userName": "Martin", "favoriteNumber": 1337}, encoder)
raw_bytes = buf.getvalue()
print(f"Encoded: {raw_bytes.hex()} ({len(raw_bytes)} bytes)")
# Encoded: 0c4d617274696e02f214 (10 bytes)

# Reader schema v2 — has an extra field with default
reader_schema = avro.schema.parse("""
{
  "type": "record", "name": "User",
  "fields": [
    {"name": "userName", "type": "string"},
    {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}
""")

# Decode v1 data with v2 reader schema — email gets default
buf = io.BytesIO(raw_bytes)
decoder = avro.io.BinaryDecoder(buf)
reader = avro.io.DatumReader(writer_schema, reader_schema)
record = reader.read(decoder)
print(record)
# {'userName': 'Martin', 'favoriteNumber': 1337, 'email': None}
# email filled with default — backward compatibility works!

Design question: Your data pipeline exports a Postgres table to Avro files in S3 nightly. A developer adds a new NOT NULL column to the table without a default. What breaks, and why?

The S3 upload fails because the file is too large Nothing breaks — Avro handles it automatically Downstream consumers using the old reader schema will encounter a field in the writer schema (the new column) that does not exist in their reader schema — Avro simply skips it, so that part is fine. But the real problem is that consumers using the NEW reader schema to read OLD files will find the column missing AND there is no default (it is NOT NULL with no default). Avro schema resolution will fail because it cannot fill in a missing field without a default. The fix: always add columns with defaults in both the database and the Avro schema

Chapter 6: Schema Registries

We have seen that Avro needs the writer's schema at read time, and that Protobuf/Thrift need the .proto/.thrift files to decode messages. But how do you distribute these schemas across a distributed system with dozens of services? You cannot email .proto files around. You need a schema registry.

What a Schema Registry Does

A schema registry is a centralized service that stores versioned schemas and enforces compatibility rules. The most widely deployed is the Confluent Schema Registry, designed to work with Apache Kafka. Here is the workflow:

1. Producer writes schema

Producer has schema v2 with a new "email" field. Before sending data, it registers the schema with the registry. The registry checks that v2 is compatible with v1 (using the configured compatibility mode). If compatible, the registry assigns schema ID 47.

↓

2. Producer sends data

Each Kafka message is prefixed with a 5-byte header: a magic byte (0x00) followed by the 4-byte schema ID (47). The rest of the message is the Avro-encoded payload — raw bytes, no schema embedded.

↓

3. Consumer reads data

Consumer extracts schema ID 47 from the message header. It looks up schema 47 from the registry (with caching). Now it has the writer's schema and can perform Avro resolution against its own reader schema.

The 5-byte header is a brilliant design: it adds negligible overhead (5 bytes per message) but gives you full schema evolution support. Compare this to embedding the entire schema in every message (which Avro container files do for HDFS) — for a Kafka topic with millions of small messages, embedding the schema would be catastrophically wasteful.

Compatibility Modes

The registry does not just store schemas — it enforces rules about which schema changes are allowed. There are four compatibility modes:

Mode	Rule	What it allows	Use case
BACKWARD	New schema can read old data	Add fields with defaults, remove fields	Consumer upgrades first, then producer
FORWARD	Old schema can read new data	Remove fields, add optional fields	Producer upgrades first, then consumer
FULL	Both backward AND forward	Add/remove optional fields with defaults	Rolling deployments (safest)
NONE	No checks	Anything	Development only (dangerous in prod)

FULL is the default for good reason. In a microservices environment with rolling deployments, you need both forward and backward compatibility simultaneously. Old consumers must read new data (forward), and new consumers must read old data (backward). FULL compatibility mode enforces both. The only safe schema changes under FULL are: add a field with a default, or remove a field that had a default.

Schema Registry Workflow

The simulation below animates the complete workflow: producer registers schema, registry checks compatibility, producer sends data with schema ID, consumer looks up schema, consumer decodes. Watch what happens when a compatibility violation is attempted.

Schema Registry: Register, Send, Decode

Watch the producer register a schema, send data, and the consumer decode it. Try a breaking change.

Step 1: Producer has schema v1. Click "Next Step" to begin.

Schema Registry in Practice

python
# Producing Avro messages with Confluent Schema Registry
from confluent_kafka import SerializingProducer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

# Connect to registry
registry = SchemaRegistryClient({'url': 'http://schema-registry:8081'})

# Define schema (or load from file)
schema_str = """
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "userId", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}
"""

# Create serializer (auto-registers schema with registry)
avro_serializer = AvroSerializer(registry, schema_str)

producer = SerializingProducer({
    'bootstrap.servers': 'kafka:9092',
    'value.serializer': avro_serializer,
})

# Produce a message — schema ID is prepended automatically
producer.produce(
    topic='users',
    value={'userId': 42, 'name': 'Alice', 'email': 'alice@example.com'}
)
producer.flush()

System design: Your team uses Kafka with the Confluent Schema Registry set to FULL compatibility. A developer submits a schema change that adds a new field age of type int with no default value. What happens?

The schema registry rejects the change with a compatibility error. Under FULL mode, new schemas must be both forward AND backward compatible. A new field without a default breaks backward compatibility: new code reading old data will find the field missing and have no default to fill in. The fix: change the type to ["null", "int"] with "default": null, or "int" with "default": 0 The schema is accepted and old consumers crash The schema is accepted and the age field is silently ignored

Chapter 7: Modes of Dataflow

Schema evolution does not exist in a vacuum. The way data flows between processes determines which compatibility properties matter and which encoding formats are practical. There are three fundamental modes of dataflow, each with different implications for encoding and evolution.

Mode 1: Via Databases

When you write to a database, you are encoding data for your future self — or for a colleague who will read it months or years from now. The writer encodes, and a potentially very different reader decodes. This means you need both forward compatibility (old application reading newly-written data) and backward compatibility (new application reading old data that was written years ago).

The most dangerous scenario: a v1 application reads a record written by v2 (which has extra fields), modifies an unrelated field, and writes it back. If the application does not preserve unknown fields, the v2 fields are silently erased. This is called the read-modify-write hazard, and it is the single most common source of silent data corruption in production systems.

Mode 2: Via Service Calls (REST/RPC)

In service-to-service communication, one process (the client) makes a request and another (the server) responds. The client and server may be running different versions of the code.

Protocol	Encoding	Schema enforcement	Evolution
REST + JSON	JSON text	None (OpenAPI is documentation)	Ad-hoc versioning (URL path, headers)
gRPC	Protobuf binary	Strict (.proto file required)	Protobuf evolution rules
Thrift RPC	Thrift binary	Strict (.thrift file required)	Thrift evolution rules
GraphQL	JSON	Schema (SDL)	Add fields freely, deprecate old ones

gRPC (Protobuf over HTTP/2) has become the standard for internal service-to-service communication. It is smaller and faster than REST+JSON, has built-in schema evolution via Protobuf, supports streaming (server-side, client-side, and bidirectional), and generates client/server code automatically from .proto files.

REST for public, gRPC for private. REST + JSON is still the right choice for public APIs consumed by third-party developers (human readability, curl-ability, browser compatibility). gRPC is the right choice for internal microservice communication (performance, schema enforcement, code generation). Many companies run both: a gRPC mesh internally, with a REST gateway at the edge.

Mode 3: Via Message Queues

In asynchronous messaging (Kafka, RabbitMQ, SQS), the producer and consumer are fully decoupled. The producer may have been deployed weeks before the consumer. Messages may sit in the queue for hours or days. The consumer may be running code that was written before the new fields even existed.

This is where schema registries shine. The producer registers its schema (with compatibility checks), embeds the schema ID in each message, and the consumer resolves schemas at read time. Without a registry, you are flying blind — every consumer must somehow know which schema the producer was using, and version mismatches cause silent data corruption.

The Big Simulation: All Three Modes

The simulation below shows data flowing through all three modes simultaneously. You can modify the schema (add a field) and watch how the change propagates. Notice where compatibility breaks happen and where the system handles evolution gracefully.

Three Modes of Dataflow — Schema Change Propagation

Add a field to the schema and watch it propagate through databases, APIs, and message queues. Observe compatibility outcomes.

Schema v1 in all three modes. Click "Add Optional Field" then "Propagate Change" to watch evolution.

The Read-Modify-Write Hazard (Detailed)

This is so important that it deserves a concrete, step-by-step walkthrough:

// Step 1: v2 server writes a record
DB row = {userId: 42, name: "Alice", email: "alice@ex.com", phone: "+1-555-0123"}

// Step 2: v1 server reads the record
v1 code sees: {userId: 42, name: "Alice", email: "alice@ex.com"}
v1 code does NOT see phone (unknown field).

// Step 3: v1 server modifies email and writes back
v1 code writes: {userId: 42, name: "Alice", email: "newemail@ex.com"}

// Step 4: phone is GONE. Silently erased.
// No error, no log, no alert. Just data loss.

The fix depends on the encoding format. Protobuf and Avro decoders can be configured to preserve unknown fields: when v1 code reads a record with unknown tags, it stores the raw bytes in a side buffer and re-emits them on write. This "round-trip safety" is not automatic in all implementations — you must verify that your serialization library supports it.

System design: You are building a microservices platform with 50 services. Internal communication uses gRPC. Event streaming uses Kafka + Avro. External APIs use REST + JSON. A PM asks "why not just use JSON everywhere?" Give the technical argument.

JSON is slower to parse JSON does not support nested objects Three reasons: (1) Schema enforcement — gRPC/Protobuf and Kafka/Avro schemas are enforced at the wire level, catching type errors at serialization time rather than at runtime in application code. JSON has no built-in schema enforcement. (2) Size and speed — binary formats are 4-10x smaller and faster to parse, which matters at 50-service scale with millions of internal RPCs per second. (3) Evolution guarantees — Protobuf's tag-based evolution and Avro's schema resolution give mathematical guarantees about backward/forward compatibility. JSON evolution is ad-hoc and error-prone. JSON is kept for external APIs because third-party developers need human readability and curl-ability

Chapter 8: Dataflow in Practice

Theory is clean. Practice is messy. Let us look at how real systems combine encoding formats, handle versioning, and debug the inevitable breakages.

The Modern Stack

Use case	Format	Schema management	Why
Service-to-service (internal)	gRPC (Protobuf)	.proto files in monorepo + buf.build	Strict types, code generation, streaming, small payloads
Event streaming	Kafka + Avro	Confluent Schema Registry (FULL mode)	Decoupled producers/consumers, automatic schema evolution
Mobile clients	Protobuf	.proto files bundled with app	Small payloads over cellular, strict types, backward compat for old app versions
Public REST APIs	JSON	OpenAPI spec (documentation, not enforcement)	Human readable, universal client support, curl-able
Data lake / batch	Avro or Parquet	Schema embedded in file header	Self-describing files, columnar for analytics (Parquet)
Configuration	JSON, YAML, TOML	None (human-edited)	Human readability trumps all other concerns

Design Challenge: Schema Evolution Strategy for 50 Services

Imagine you are the staff engineer responsible for data contracts across a platform of 50 microservices. Here is the system you would design:

1. Schema Repository

All .proto files live in a monorepo (or a dedicated schema repo). Every service that needs a schema imports it from here. buf.build (or protoc with a custom plugin) runs lint + breaking-change detection in CI. A PR that adds a required field is blocked automatically.

↓

2. Code Generation Pipeline

CI generates client/server stubs in every language the platform uses (Go, Python, Java, TypeScript). Generated code is published as versioned packages. Services pin to a specific schema version and upgrade on their own schedule.

↓

3. Compatibility Gate

The schema registry (for Kafka/Avro) and buf breaking (for gRPC/Protobuf) reject incompatible changes before they reach production. This is the single most important automation — it prevents the 3 AM page.

↓

4. Runtime Validation

Each service has middleware that validates incoming messages against the expected schema. Malformed messages are dead-lettered (sent to an error queue) rather than crashing the service. Monitoring dashboards track deserialization error rates per service per topic.

Debug Scenarios

These are the failure modes you will encounter in production (and in interviews):

Scenario 1: "Deserialization failures after deploying a new service version."

Root cause: a required field was added. Old messages in the Kafka topic (or old records in the database) do not have this field. The new code's deserializer throws because the required field is missing. Fix: make the field optional with a default. Redeploy. The error rate drops to zero because the deserializer now fills in the default for old records.

Scenario 2: "Silent data corruption — users report missing phone numbers."

Root cause: a v1 service reads records written by v2, modifies an unrelated field, and writes back without preserving unknown fields. phone_number (tag 3) is silently dropped on every round-trip through v1. Fix: upgrade all services to a Protobuf library version that preserves unknown fields. Add a monitoring check that detects field disappearance (compare field counts before and after writes).

Scenario 3: "Kafka consumer lag spike after producer schema change."

Root cause: the producer registered a new schema that is technically compatible, but the consumer's deserialization code has a bug: it tries to cast the new field to an incompatible type in application code (e.g., the schema says the field is a string, but the consumer's application code tries to parse it as an integer). The schema registry did not catch this because it only checks wire-format compatibility, not application-level semantics. Fix: add integration tests that deserialize sample messages with every consumer. Run these tests in the producer's CI pipeline before the schema change is merged.

Anti-Patterns

gRPC vs REST: A Concrete Comparison

Let us compare the same API call in both formats to make the trade-offs tangible.

protobuf
// gRPC: Define the service in a .proto file
service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc ListUsers(ListUsersRequest) returns (stream User);  // server streaming!
}

message GetUserRequest {
  int64 user_id = 1;
}

message User {
  int64 user_id = 1;
  string name = 2;
  string email = 3;
  optional string phone = 4;
}

python
# gRPC client (auto-generated from .proto)
import grpc
import user_pb2, user_pb2_grpc

channel = grpc.insecure_channel('user-service:50051')
stub = user_pb2_grpc.UserServiceStub(channel)

# Type-safe, IDE-autocompleted, binary-encoded
response = stub.GetUser(user_pb2.GetUserRequest(user_id=42))
print(response.name)   # "Alice"
print(response.phone)  # "" (default, field 4 missing in old data)

# Server streaming: get 10K users without buffering all in memory
for user in stub.ListUsers(user_pb2.ListUsersRequest(limit=10000)):
    process(user)  # each user arrives as it's ready

python
# REST equivalent — compare the developer experience
import requests

# No type safety, no auto-completion, text-encoded
response = requests.get('http://user-service:8080/api/v1/users/42')
data = response.json()  # dict, not a typed object
print(data['name'])    # KeyError if field missing
print(data.get('phone', ''))  # manual default handling

# No streaming: must buffer all 10K users in one JSON array
response = requests.get('http://user-service:8080/api/v1/users?limit=10000')
users = response.json()  # entire 10K-user array in memory at once

Dimension	gRPC + Protobuf	REST + JSON
Request size (GetUser)	~5 bytes (varint 42)	~45 bytes (URL + headers)
Response size (1 user)	~30 bytes	~120 bytes
Type safety	Compile-time (generated stubs)	Runtime (manual validation)
Streaming	Native (server, client, bidirectional)	Workarounds (SSE, chunked, WebSocket)
Schema evolution	Protobuf tag-based rules	Ad-hoc URL/header versioning
Human readability	No (binary, need grpcurl)	Yes (curl, browser, any HTTP tool)
Browser support	Limited (grpc-web proxy needed)	Native (fetch, XMLHttpRequest)

Anti-pattern 1: JSON for high-throughput internal communication. A team uses JSON for Kafka messages between internal services. At 100K messages/sec, they are spending 40% of CPU on JSON parsing and allocating 3x more network bandwidth than necessary. Switch to Avro or Protobuf.

Anti-pattern 2: Unversioned APIs. A public API returns JSON with no version indicator. When the team changes a field type from integer to string, every client breaks. The fix is URL versioning (/v1/users, /v2/users) or header versioning (Accept: application/vnd.myapi.v2+json).

Anti-pattern 3: Using the schema registry in NONE compatibility mode in production. "We'll be careful" is not an engineering strategy. FULL mode exists to enforce what "being careful" means. Every schema change that passes FULL is safe by construction.

Format Comparison

The simulation below shows the same 1000-message workload processed by different format combinations. Compare throughput, payload size, and schema safety.

Format Comparison: Throughput, Size, Safety

Compare JSON, MessagePack, Protobuf, and Avro across three dimensions.

Click a metric to compare formats.

Debug scenario: After a Kafka producer deploys a new Avro schema, consumers start logging "Could not find schema ID 147 in registry" errors at a rate of ~5%. The remaining 95% of messages decode fine. What is the most likely cause?

The producer is running two instances: most already have the new schema (ID 147 registered), but some instances are in the process of registering and the registration is eventually consistent. OR: the consumer's schema registry client cache has a TTL and some lookups hit a replica that has not yet replicated ID 147. The fix: ensure the producer registers the schema BEFORE sending any messages (not lazily on first send), and configure the consumer's registry client to retry on cache miss The schema registry is running out of storage Avro cannot handle more than 146 schema versions

Chapter 9: Interview Arsenal

This chapter distills everything into the cheat sheets, drills, and patterns you need for system design and coding interviews.

Format Cheat Sheet

Format	Wire format	Schema?	Field identity	Evolution mechanism	Typical size (our example)
JSON	Text	No (optional JSON Schema)	String names	Ad-hoc (versioned URLs)	~45 bytes
XML	Text	Optional (XSD)	Tag names	XSD versioning	~95 bytes
MessagePack	Binary	No	String names	Same as JSON	~33 bytes
Thrift Binary	Binary	Required (.thrift)	Numeric tags	Tag-based skip	~35 bytes
Thrift Compact	Binary	Required (.thrift)	Numeric tags	Tag-based skip	~27 bytes
Protobuf	Binary	Required (.proto)	Numeric tags	Tag-based skip + defaults	~11 bytes
Avro	Binary	Required (JSON/IDL)	Name resolution	Writer/reader schema match	~10 bytes

System Design Questions

Q: "Design a schema evolution strategy for a microservices platform."

Answer framework: (1) Schema repo with CI-enforced compatibility checks (buf breaking for Protobuf, schema registry with FULL mode for Avro/Kafka). (2) Code generation pipeline publishing versioned stubs. (3) Runtime validation middleware with dead-letter queues. (4) Monitoring: deserialization error rate per service per topic. (5) Policy: all new fields optional with defaults, never reuse tag numbers, never change types.

Q: "Compare REST vs gRPC for internal services."

Answer: gRPC wins on all internal metrics: (1) Payload size: Protobuf binary is 4-10x smaller than JSON. (2) Parse speed: binary parsing is ~10x faster. (3) Schema enforcement: .proto files catch type errors at compile time, not runtime. (4) Code generation: stubs in any language for free. (5) Streaming: gRPC supports server/client/bidirectional streaming natively. (6) HTTP/2: multiplexed connections, header compression. REST wins for external APIs: human readable, universally supported, curl-testable, no build step for consumers.

Q: "A Kafka consumer is deserializing messages 10x slower than expected. Diagnose."

Answer framework: (1) Check if the consumer is hitting the schema registry on every message instead of caching. (2) Check if the consumer is using reflection-based deserialization instead of generated code. (3) Profile the deserialization: is it CPU-bound (parsing) or memory-bound (allocations)? (4) Check if the messages contain deeply nested Avro unions that require extensive resolution. (5) Compare against a baseline: deserialize 1M messages of the same schema and measure throughput.

Coding Drills

Drill 1: Write a Protobuf schema with evolution.

protobuf
// v1: initial schema
syntax = "proto3";

message UserEvent {
  int64 user_id = 1;
  string action = 2;     // "login", "purchase", etc.
  int64 timestamp = 3;    // Unix millis
}

// v2: add optional metadata field
message UserEvent {
  int64 user_id = 1;
  string action = 2;
  int64 timestamp = 3;
  string ip_address = 4;  // NEW: optional (proto3 default)
  map<string, string> metadata = 5;  // NEW: flexible key-value
}

// v3: deprecate a field, add another
message UserEvent {
  int64 user_id = 1;
  string action = 2;
  int64 timestamp = 3;
  reserved 4;             // ip_address removed, tag 4 reserved forever
  reserved "ip_address";  // name also reserved
  map<string, string> metadata = 5;
  string session_id = 6; // NEW field gets NEW tag number
}

Drill 2: Implement varint encoder/decoder.

python
def encode_varint(n):
    """Encode unsigned integer as Protobuf varint bytes."""
    buf = []
    while n > 0x7F:
        buf.append((n & 0x7F) | 0x80)
        n >>= 7
    buf.append(n)
    return bytes(buf)

def decode_varint(data):
    """Decode varint from bytes. Returns (value, bytes_consumed)."""
    result = 0
    shift = 0
    for i, byte in enumerate(data):
        result |= (byte & 0x7F) << shift
        if not (byte & 0x80):
            return result, i + 1
        shift += 7
    raise ValueError("Varint not terminated")

def zigzag_encode(n):
    return (n << 1) ^ (n >> 63)

def zigzag_decode(n):
    return (n >> 1) ^ -(n & 1)

# Test round-trips
for v in [0, 1, 127, 128, 300, 1337, 1000000]:
    encoded = encode_varint(v)
    decoded, length = decode_varint(encoded)
    assert decoded == v, f"Failed for {v}"
    print(f"{v:>10} -> {encoded.hex():12} ({length} bytes)")

# Output:
#          0 -> 00           (1 bytes)
#          1 -> 01           (1 bytes)
#        127 -> 7f           (1 bytes)
#        128 -> 8001         (2 bytes)
#        300 -> ac02         (2 bytes)
#       1337 -> b90a         (2 bytes)
#    1000000 -> c0843d       (3 bytes)

Drill 3: Decode a Protobuf Message by Hand

This is the kind of question a senior interviewer at Google or Meta might ask. You are given raw Protobuf bytes and a schema. Decode the message.

// Given schema:
message Order {
  int64 order_id = 1;
  string product = 2;
  int32 quantity = 3;
}

// Given bytes (hex): 08 96 01 12 05 50 69 7A 7A 61 18 02

// Step 1: Parse first byte 0x08
0x08 = 0000 1000 → field_number = 08 >> 3 = 1, wire_type = 08 & 7 = 0 (varint)
// Field 1 (order_id), wire type varint

// Step 2: Parse varint 0x96 0x01
0x96 = 1|0010110 → MSB=1, data=0010110 (more bytes follow)
0x01 = 0|0000001 → MSB=0, data=0000001 (last byte)
Value = 0000001_0010110 = 150 (decimal)
// order_id = 150

// Step 3: Parse byte 0x12
0x12 = 0001 0010 → field_number = 18 >> 3 = 2, wire_type = 18 & 7 = 2 (length-delimited)
// Field 2 (product), wire type length-delimited

// Step 4: Parse length 0x05 = 5 bytes, then read 5 bytes
50 69 7A 7A 61 = "Pizza" in UTF-8
// product = "Pizza"

// Step 5: Parse byte 0x18
0x18 = 0001 1000 → field_number = 24 >> 3 = 3, wire_type = 24 & 7 = 0 (varint)
// Field 3 (quantity), wire type varint

// Step 6: Parse varint 0x02 = 2
// quantity = 2

// Result: {order_id: 150, product: "Pizza", quantity: 2}
// Total: 12 bytes. JSON equivalent: ~50 bytes.

The Five Interview Dimensions Applied

Dimension	Example question	What a staff answer includes
Concept	"Explain forward vs backward compatibility"	Concrete rolling deployment scenario, not just definitions. "During a 30-minute deploy, server 3 is v2 but server 5 is still v1..."
Design	"Design a data contract system for 50 services"	Schema repo + CI + registry + dead-letter queues + monitoring. Time horizons: this quarter, next generation, hardware transition.
Code	"Implement a varint encoder"	Clean code + edge cases (value=0, max int64) + test cases + complexity analysis (O(log n) bytes)
Debug	"Deserialization errors after deploy"	Systematic: check schema change → check compatibility mode → check if required field added → check consumer cache → check schema registry replication
Frontier	"What's next after Protobuf?"	Cap'n Proto (zero-copy), FlatBuffers (random access without parsing), Proto Editions (Google's next-gen), Apache Arrow (columnar in-memory format)

Staff-level design: You're building a platform where mobile apps (updated monthly), backend services (deployed daily), and batch pipelines (run weekly on historical data) all share the same user event schema. Which encoding format and evolution strategy would you choose, and why?

JSON everywhere — it's the simplest and most universal Protobuf for mobile-to-backend (small payloads, strict types, code generation for iOS/Android). Kafka + Avro + Schema Registry (FULL mode) for the event stream that feeds batch pipelines (self-describing for historical reads, automatic schema evolution). Schema repo in CI with buf breaking for Protobuf and compatibility checks for Avro. Mobile apps pin to a schema version and are backward-compatible for 6 months. Batch pipelines use Avro container files with embedded writer schemas so they can read data from any era. All new fields are optional with defaults. Tag numbers are never reused. gRPC everywhere, including mobile and batch

Chapter 10: Connections

Encoding and evolution do not exist in isolation. They interact deeply with every other layer of a data-intensive system. Here is where this chapter connects to the rest of the DDIA landscape.

Where Encoding Fits

Related topic	Connection	Lesson
Data Models (Ch 3)	The data model determines what you serialize. Relational tables map naturally to Avro schemas. Document models (JSON) map to... JSON. Graph models need custom serialization for edges and vertices.	Data Models & Query Languages
Storage & Retrieval (Ch 4)	Storage engines use encoding internally. LSM-trees write SSTables in a binary format. B-trees store fixed-size pages. Column stores use specialized compression (run-length, bitmap, dictionary) that is itself a form of encoding.	Storage & Retrieval
Replication (Ch 6)	Replication logs must be encoded in a format that supports schema evolution. A follower running an older version must be able to read the leader's replication stream (forward compatibility). Statement-based replication encodes SQL text. Row-based replication encodes binary tuples.	Replication
Partitioning (Ch 7)	Partitioning keys must survive encoding round-trips. If your partition key is a 64-bit integer and your encoding truncates it to 32 bits, records will hash to the wrong partition.	Partitioning
Stream Processing (Ch 12)	Kafka + Avro + Schema Registry is the dominant pattern for stream processing. Every concept from this chapter — schema evolution, compatibility modes, schema resolution — is the foundation of a reliable streaming platform.	Stream Processing

The Encoding Decision Tree

When choosing a format for a new system, walk this tree:

Is the consumer a web browser or third-party developer?

Yes → JSON (with OpenAPI spec for documentation). Human readability and universal support trump all other concerns.

↓ No

Do you need the data to be self-describing (no shared schema)?

Yes → MessagePack (binary JSON) or Avro container files (binary with embedded schema). Both are language-independent and self-contained.

↓ No

Is the data flowing through Kafka or a message queue?

Yes → Avro + Schema Registry. The registry provides schema discovery and compatibility enforcement for decoupled producers/consumers.

↓ No

Is it service-to-service RPC?

Yes → gRPC (Protobuf). Code generation, streaming, strict types, compact binary encoding. The modern standard for internal microservice communication.

What We Did Not Cover

This chapter focused on the fundamental concepts. More advanced topics include:

Cap'n Proto — Zero-copy serialization. The on-wire format IS the in-memory format. No encoding/decoding step at all. Trades portability for speed.
FlatBuffers (Google) — Random access to individual fields without parsing the entire message. Used heavily in games and mobile apps.
Apache Arrow — Columnar in-memory format for analytics. Not a serialization format per se, but a shared memory layout that eliminates serialization between analytics tools.
Protocol Buffer Editions — Google's successor to proto2/proto3 syntax, designed to unify the divergent feature sets.

The Feynman test. If someone wakes you at 3 AM and asks "why does your service use Protobuf internally and JSON externally?" you should be able to answer in one sentence: "Protobuf gives us schema-enforced types, 4x smaller payloads, and guaranteed forward/backward compatibility for rolling deploys, while JSON gives external developers human readability and universal client support." If you can say that, you have internalized this chapter.

"The key idea is that old and new code, old and new data, must coexist. Encoding formats are the contracts that make coexistence possible." — Martin Kleppmann, DDIA

Final synthesis: A startup uses JSON everywhere — REST APIs, Kafka messages, database storage (Postgres JSONB columns). They have 12 services and are growing to 50. What is the single highest-impact change they should make first?

Switch all services to XML for better schema support Add a schema registry (Confluent) for their Kafka topics and migrate internal Kafka messages from JSON to Avro. This is the highest-impact single change because: (1) Kafka messages are the highest-volume data path, so the size/speed improvement is biggest here. (2) The schema registry enforces compatibility, preventing the silent data corruption that will become catastrophic at 50 services. (3) It is incremental — they can migrate one topic at a time without touching REST APIs or the database. REST JSON for external APIs and JSONB for the database are fine for now. Rewrite everything in gRPC immediately