JSON, Protobuf, Avro, schema evolution — how data survives change.
Your service stores user profiles as JSON. Version 1 looks like this:
json { "userId": 42, "name": "Alice", "email": "alice@example.com" }
Today you ship version 2 with a new field:
json { "userId": 42, "name": "Alice", "email": "alice@example.com", "phoneNumber": "+1-555-0123" }
Seems harmless. But here is the nightmare scenario: you have 50 servers, and you deploy the new code via a rolling deployment — one server at a time, over 30 minutes. For those 30 minutes, some servers run v1 (old code, no phone_number) and some run v2 (new code, expects phone_number). All of them read and write to the same database.
Now consider the chaos:
| Writer | Reader | What happens |
|---|---|---|
| v2 server writes phone_number | v1 server reads it | v1 sees an unknown field. Does it crash? Ignore it? Silently drop it on the next write? |
| v1 server writes (no phone_number) | v2 server reads it | v2 expects phone_number. Does it crash? Use a default? Show blank in the UI? |
| v2 server reads, then writes back | v1 server reads the result | If v1 re-writes the record, does it strip phone_number? Data loss. |
And it gets worse. This isn't just about databases. The same version mismatch happens with:
The simulation below shows a rolling deployment in progress. Six servers are being upgraded from v1 to v2 one at a time. All of them read and write to the same database. Watch how old and new code coexist, and notice the window of danger when both versions are active simultaneously.
Watch servers upgrade one by one. Observe read/write conflicts between versions.
During the upgrade window, every database write is a potential compatibility issue. A v2 server writes a record with phone_number. Moments later a v1 server reads it, ignores the unknown field, then writes it back — erasing phone_number forever. This silent data loss is one of the most common bugs in production distributed systems.
To survive rolling deployments, your data format must support two properties:
Forward compatibility: old code can read data written by new code. The v1 server encounters phone_number. It doesn't know what phone_number is, but it doesn't crash — it ignores the unknown field and preserves it when writing back.
Backward compatibility: new code can read data written by old code. The v2 server reads a record without phone_number. It doesn't crash — it fills in a sensible default (null, or an empty string) and continues.
Some encoding formats make both of these easy. Some make them impossible. That is the subject of this lesson.
Before we compare JSON vs Protobuf vs Avro, we need to understand why encoding exists at all. The answer comes down to two fundamentally different representations of data.
In memory, data lives as objects, structs, hash maps, arrays, and trees. These structures are optimized for CPU access: pointers let you jump to any field in O(1), data can reference other data via shared pointers, and the layout is designed for the memory hierarchy (cache lines, page alignment). A Python dict, a Java HashMap, a Go struct — these are all in-memory representations.
On the wire (or on disk), data must become a flat sequence of bytes. You cannot send a pointer over the network — the receiving process has a completely different memory space. The byte sequence must be self-contained: it carries all the information needed to reconstruct the original data, with no external references.
The translation from in-memory to bytes is called encoding (also called serialization or marshalling). The reverse — bytes back to in-memory objects — is decoding (also called deserialization, parsing, or unmarshalling).
Many programming languages offer built-in serialization: Java has java.io.Serializable, Python has pickle, Ruby has Marshal. These are convenient — you can serialize any in-memory object in one line of code. But they have severe problems:
| Problem | Why it matters |
|---|---|
| Language lock-in | A Python pickle can only be read by Python. If your analytics team uses Java or your mobile app uses Kotlin, they cannot read the data. |
| Security | Deserializing arbitrary bytes can execute code. Python's pickle.loads() will happily run os.system("rm -rf /") if the attacker crafted the bytes. This is not hypothetical — pickle deserialization attacks are a standard exploit class. |
| Versioning | If you add a field to a class and try to deserialize old data, you often get cryptic errors. There is no built-in mechanism for forward or backward compatibility. |
| Performance | These formats are often slower and produce larger outputs than purpose-built alternatives like Protobuf. |
The simulation below shows an in-memory object being encoded into bytes, transmitted, and decoded back into an in-memory object. Notice the information that must be preserved (field names, types, values) and the information that is lost (memory addresses, method definitions, private internal state).
Click Encode to serialize the object, then Transmit to send, then Decode to reconstruct.
| Information | In Memory | On Wire |
|---|---|---|
| Field names and values | Yes | Yes (must be preserved) |
| Data types (int, string, bool) | Yes (implicit from language) | Depends on format |
| Memory addresses (pointers) | Yes | No (meaningless to receiver) |
| Methods / class behavior | Yes | No (receiver has its own code) |
| Schema / version info | Implicit | Depends on format |
The encoding format you choose determines which columns say "Yes" and which say "Depends." JSON preserves field names as strings but loses precise types (is 42 an int32 or an int64?). Protobuf preserves field tags (numbers, not names) and precise types, but requires a schema to decode. Avro preserves field names and types but requires the writer's schema at read time. Each choice has consequences for compatibility, performance, and usability.
If language-specific formats are a trap, what should you use? The first answer most developers reach for is a human-readable text format: JSON, XML, or CSV. These formats are widely supported, easy to debug (you can open them in a text editor), and language-independent. But they each have subtle problems that bite you at scale.
JSON (JavaScript Object Notation) is the dominant format for web APIs. Its strengths are real: every language has a JSON library, it is human-readable, and it supports nested objects and arrays. But JSON has three landmines that every systems engineer must know about.
Landmine 1: Number ambiguity. JSON does not distinguish between integers and floating-point numbers. The value 42 might be parsed as a 32-bit int, a 64-bit int, or a 64-bit float — it depends entirely on the parser implementation. This matters enormously for large integers.
Number type is an IEEE 754 double-precision float with 53 bits of mantissa. Any integer larger than 253 (9,007,199,254,740,992) loses precision when parsed by a JavaScript client. If your database assigns 64-bit integer IDs, a JavaScript front-end will silently corrupt them. Twitter famously hit this bug: their tweet IDs exceeded 253, and JavaScript clients started returning wrong IDs. The fix? Send large integers as strings in JSON. Twitter's API returns both "id": 12345678901234567 and "id_str": "12345678901234567".Landmine 2: No binary support. JSON can only represent text. If you need to embed binary data (images, encrypted blobs, compressed data), you must encode it as Base64 — which inflates the data by ~33%. A 1 MB image becomes ~1.37 MB in the JSON payload.
Landmine 3: No schema. JSON is schemaless by default. A field can be a string in one record and a number in another. There is no enforcement mechanism built into the format. JSON Schema exists but it is a documentation tool, not a wire-format enforcement tool — there is nothing in the JSON bytes that says "this field must be an integer."
XML (Extensible Markup Language) was the dominant data format before JSON took over. It supports schemas (XSD), namespaces, attributes vs. elements, and XSLT transformations. But it is verbose — every value is wrapped in opening and closing tags — and the schema language (XSD) is notoriously complex. A simple user profile in XML is roughly 2-3x the size of the equivalent JSON.
CSV (Comma-Separated Values) is barely a format at all. There is no official standard (RFC 4180 is a best-effort attempt). There is no schema. There is no type information. If a value contains a comma, you quote it. If a value contains a quote, you escape it. If a value contains a newline... it depends on the parser. CSV is fine for exporting spreadsheets. It is not a serious option for inter-service communication.
The simulation below encodes the same data in JSON, XML, and CSV. Watch the byte counts. Also notice where JSON silently corrupts a large integer.
Compare byte sizes. Watch the integer precision problem in JSON.
json // JSON: 67 bytes {"userId":42,"name":"Alice","email":"alice@example.com"}
xml <!-- XML: 138 bytes --> <user> <userId>42</userId> <name>Alice</name> <email>alice@example.com</email> </user>
csv # CSV: 40 bytes (but no schema, no types, no nesting) 42,Alice,alice@example.com
CSV is smallest, but it cannot represent nested data, has no type information, and breaks if any field contains a comma. XML is largest because of the tag overhead. JSON is the sweet spot for human-readability, but still suffers from the type ambiguity and size overhead we discussed.
JSON is human-readable but wastes bytes. Every field name is repeated in full text for every single record. The string "favoriteNumber" is 16 bytes — and you pay that cost in every message, every record, millions of times per day. Binary encoding formats solve this by replacing text with compact binary representations.
MessagePack is the simplest binary format. It is essentially JSON with a binary wire format: field names are still stored as strings, but values are encoded more compactly (integers use variable-length encoding, strings carry explicit lengths instead of quote delimiters). A JSON object like {"userName":"Martin","favoriteNumber":1337} shrinks from ~45 bytes to ~33 bytes.
MessagePack is a quick win — smaller than JSON, faster to parse — but it still carries full field names. The real savings come from formats that eliminate field names entirely.
Protocol Buffers (Protobuf, created at Google) and Apache Thrift (created at Facebook) take a different approach. Instead of encoding field names, they assign each field a numeric tag in a schema definition. The schema is shared between sender and receiver ahead of time. The wire format only carries the tag numbers, not the names.
protobuf // Protocol Buffers schema (.proto file) message Person { string user_name = 1; // tag 1 int64 favorite_number = 2; // tag 2 }
thrift // Thrift schema (.thrift file) struct Person { 1: string userName, 2: i64 favoriteNumber, }
When this message is encoded, the bytes contain (tag=1, type=string, value="Martin") and (tag=2, type=varint, value=1337). The tag is a small integer (1-2 bytes), not a 14-character string. This saves enormous amounts of space when you have millions of records.
Protobuf does not use fixed-width integers. The number 1 should not cost 8 bytes just because the schema says int64. Instead, Protobuf uses varint encoding: each byte uses 7 bits for data and 1 bit (the MSB, most significant bit) as a continuation flag. If the MSB is 1, more bytes follow. If it is 0, this is the last byte.
For signed integers, Protobuf uses ZigZag encoding to map signed values to unsigned varints: 0 → 0, -1 → 1, 1 → 2, -2 → 3, 2 → 4, and so on. This keeps small negative numbers small (without ZigZag, -1 as a 64-bit two's complement would encode as a 10-byte varint).
Thrift offers two encoding formats with different trade-offs:
| Format | Field header | Integer encoding | Our example |
|---|---|---|---|
| BinaryProtocol | type (1 byte) + tag (2 bytes) = 3 bytes | Fixed-width (i64 = 8 bytes always) | ~35 bytes |
| CompactProtocol | type+tag in 1-2 bytes (delta encoding) | Varint + ZigZag | ~27 bytes |
CompactProtocol packs the field type and tag delta into a single byte when the tag delta is small (which it usually is, since tags are sequential). This makes it almost as compact as Protobuf.
The simulation below encodes the same record in four formats. Watch how field names disappear in Protobuf and Thrift, replaced by tiny tag numbers. Compare the final byte counts.
Same data: {"userName": "Martin", "favoriteNumber": 1337}. Compare byte sizes across formats.
Let us encode {"userName": "Martin", "favoriteNumber": 1337} in Protobuf, byte by byte:
Eleven bytes. The JSON version was roughly 45 bytes. That is a 4x reduction for a single tiny record. At a million records per second, that is 34 MB/s of saved bandwidth. At a billion records in a database, that is 34 GB of saved storage.
python # Varint encoder in Python — interview coding drill def encode_varint(value): """Encode a non-negative integer as a Protobuf varint.""" result = [] while value > 0x7F: result.append((value & 0x7F) | 0x80) # 7 data bits + MSB=1 value >>= 7 result.append(value & 0x7F) # last byte: MSB=0 return bytes(result) def zigzag_encode(value): """ZigZag: map signed int to unsigned for efficient varint.""" return (value << 1) ^ (value >> 63) # Test print(encode_varint(1).hex()) # "01" (1 byte) print(encode_varint(300).hex()) # "ac02" (2 bytes) print(encode_varint(1337).hex()) # "b90a" (2 bytes) print(zigzag_encode(-1)) # 1 print(zigzag_encode(1)) # 2
We have established that Protobuf and Thrift are smaller and faster than JSON. But the real reason to use them is schema evolution. Because these formats use numeric field tags instead of string names, and because they have explicit rules for handling unknown fields, they make it possible to evolve your data format safely over time.
These rules determine what changes are safe and what changes will break your system. Memorize them for interviews.
Rule 1: You can add new fields with new tag numbers. If you add a field with tag 3 to a schema that only had tags 1 and 2, old code will simply skip tag 3 when it encounters it in the data. The field is unknown, and the decoder skips it using the wire type to determine how many bytes to consume. New code, when reading old data, will see that tag 3 is missing and use the default value.
Rule 2: You cannot rename fields — but you don't need to. Field names only exist in the schema definition, not in the wire format. The name user_name is purely for programmer convenience. You can rename it to display_name in your .proto file without changing a single byte of encoded data, as long as the tag number stays the same.
Rule 3: You cannot change a field's tag number. The tag IS the field's identity. Changing tag 2 from favorite_number to tag 5 means every existing encoded record still says "tag 2 = 1337," but your new code looks for tag 5 and finds nothing. The data is lost.
Rule 4: Never add a required field after initial deployment. If you add a required field, old code will produce records without it. When new code reads those records, it will reject them because the required field is missing. This breaks backward compatibility. In Protobuf 3, all fields are optional by default — Google learned this lesson the hard way.
Rule 5: You can change types if the wire format is compatible. Changing int32 to int64 is safe — both use varint encoding, and the bits are compatible. But reading a 64-bit value with 32-bit code may truncate (the high bits are silently discarded). Changing int32 to string is not safe — the wire types are different and the decoder will misinterpret the bytes.
The simulation below shows two schema versions side by side. Try different evolution operations and see which ones are safe (green) and which ones break compatibility (red). This is exactly the kind of reasoning you must do in a system design interview when discussing API versioning.
Try different schema changes. Green = safe, Red = breaks compatibility.
| Change | Forward compatible? | Backward compatible? | Safe? |
|---|---|---|---|
| Add optional field (new tag) | Yes (old code skips unknown tag) | Yes (new code uses default) | Yes |
| Add required field | Yes | No (old data missing required field) | No |
| Remove optional field | Yes (new code uses default) | Yes (old code skips the missing tag) | Yes (but never reuse the tag number) |
| Rename field | Yes (wire uses tags not names) | Yes | Yes |
| Change tag number | No (old data uses old tag) | No (new code looks for new tag) | No |
| int32 → int64 | Yes (but may truncate) | Yes | Careful |
| int32 → string | No (wire type mismatch) | No | No |
reserved in the schema. If a future developer accidentally reuses that tag for a new field with a different type, old data in the database (which still uses that tag with the old type) will be misinterpreted. Protobuf has a reserved keyword for exactly this purpose: reserved 3, 4; reserved "old_field_name";required string phone_number = 3; field to your Protobuf schema and deploy the new server. Immediately, 30% of requests start failing with deserialization errors. Why, and what is the fix?Protobuf and Thrift solve schema evolution with field tags. Apache Avro takes a radically different approach: no tags at all. Fields are matched by name, and the encoded data contains no field identifiers whatsoever — not tags, not names, nothing. Instead, the writer's schema and reader's schema are resolved at read time.
An Avro schema looks like this:
json { "type": "record", "name": "Person", "fields": [ {"name": "userName", "type": "string"}, {"name": "favoriteNumber", "type": ["null", "long"], "default": null} ] }
When Avro encodes this record, it writes the fields in schema order with no field identifiers:
But wait — if there are no tags and no names, how does the reader know which bytes correspond to which field? The answer: the reader must have the writer's schema. Without it, the bytes are meaningless — just a stream of values with no field identity.
This is Avro's key innovation. The writer and reader can have different schemas, and Avro resolves them at read time by matching fields by name:
And in reverse:
Three strategies, each suited to a different dataflow pattern:
| Strategy | How it works | Best for |
|---|---|---|
| Avro container file | The writer's schema is embedded at the beginning of the file. All records in the file use that schema. One schema per file. | Hadoop, data lake files, batch processing |
| Connection negotiation | Writer and reader exchange schemas during connection setup (Avro RPC protocol). | RPC between long-lived services |
| Schema registry | Writer stores its schema in a central registry and embeds the schema ID in each message. Reader looks up the schema by ID. | Kafka + Avro (the standard pattern) |
The simulation below shows writer and reader schemas side by side. Watch how Avro matches fields by name, fills in defaults for missing fields, and skips fields the reader does not know about.
Writer and reader schemas differ. Watch field matching, defaults, and skipping.
Avro's killer feature is that schemas can be generated programmatically. Consider a database table:
sql CREATE TABLE users ( user_id BIGINT, name VARCHAR(255), email VARCHAR(255) );
You can automatically generate an Avro schema from this DDL. If someone runs ALTER TABLE users ADD COLUMN phone TEXT;, a new Avro schema is generated automatically. The old and new schemas coexist through Avro's resolution mechanism. No one has to manually assign tag numbers. No one can accidentally reuse a tag. The database schema is the serialization schema.
This is why Avro dominates in the Hadoop ecosystem: data pipelines that export database tables to HDFS can evolve their schemas automatically as the source database evolves.
| Property | Avro | Protobuf |
|---|---|---|
| Field identity | Names (resolved at read time) | Numeric tags (embedded in bytes) |
| Schema in bytes? | No — needs writer schema from elsewhere | No — but tags are self-identifying |
| Dynamic schemas | Excellent (generate from DB DDL) | Poor (must assign tag numbers manually) |
| Smallest encoding | Yes (no tags, no names in bytes) | Close (1-2 bytes per field for tag+type) |
| Random access | No (must read in schema order) | Yes (can skip to any field by tag) |
| Code generation | Optional (reflection works) | Primary workflow (protoc generates stubs) |
| Best for | Kafka, Hadoop, data pipelines | gRPC, mobile, microservices |
python # Writing and reading Avro with schema evolution import avro.schema, avro.io, avro.datafile import io # Writer schema v1 writer_schema = avro.schema.parse(""" { "type": "record", "name": "User", "fields": [ {"name": "userName", "type": "string"}, {"name": "favoriteNumber", "type": ["null", "long"], "default": null} ] } """) # Encode a record buf = io.BytesIO() encoder = avro.io.BinaryEncoder(buf) writer = avro.io.DatumWriter(writer_schema) writer.write({"userName": "Martin", "favoriteNumber": 1337}, encoder) raw_bytes = buf.getvalue() print(f"Encoded: {raw_bytes.hex()} ({len(raw_bytes)} bytes)") # Encoded: 0c4d617274696e02f214 (10 bytes) # Reader schema v2 — has an extra field with default reader_schema = avro.schema.parse(""" { "type": "record", "name": "User", "fields": [ {"name": "userName", "type": "string"}, {"name": "favoriteNumber", "type": ["null", "long"], "default": null}, {"name": "email", "type": ["null", "string"], "default": null} ] } """) # Decode v1 data with v2 reader schema — email gets default buf = io.BytesIO(raw_bytes) decoder = avro.io.BinaryDecoder(buf) reader = avro.io.DatumReader(writer_schema, reader_schema) record = reader.read(decoder) print(record) # {'userName': 'Martin', 'favoriteNumber': 1337, 'email': None} # email filled with default — backward compatibility works!
We have seen that Avro needs the writer's schema at read time, and that Protobuf/Thrift need the .proto/.thrift files to decode messages. But how do you distribute these schemas across a distributed system with dozens of services? You cannot email .proto files around. You need a schema registry.
A schema registry is a centralized service that stores versioned schemas and enforces compatibility rules. The most widely deployed is the Confluent Schema Registry, designed to work with Apache Kafka. Here is the workflow:
The 5-byte header is a brilliant design: it adds negligible overhead (5 bytes per message) but gives you full schema evolution support. Compare this to embedding the entire schema in every message (which Avro container files do for HDFS) — for a Kafka topic with millions of small messages, embedding the schema would be catastrophically wasteful.
The registry does not just store schemas — it enforces rules about which schema changes are allowed. There are four compatibility modes:
| Mode | Rule | What it allows | Use case |
|---|---|---|---|
| BACKWARD | New schema can read old data | Add fields with defaults, remove fields | Consumer upgrades first, then producer |
| FORWARD | Old schema can read new data | Remove fields, add optional fields | Producer upgrades first, then consumer |
| FULL | Both backward AND forward | Add/remove optional fields with defaults | Rolling deployments (safest) |
| NONE | No checks | Anything | Development only (dangerous in prod) |
The simulation below animates the complete workflow: producer registers schema, registry checks compatibility, producer sends data with schema ID, consumer looks up schema, consumer decodes. Watch what happens when a compatibility violation is attempted.
Watch the producer register a schema, send data, and the consumer decode it. Try a breaking change.
python # Producing Avro messages with Confluent Schema Registry from confluent_kafka import SerializingProducer from confluent_kafka.schema_registry import SchemaRegistryClient from confluent_kafka.schema_registry.avro import AvroSerializer # Connect to registry registry = SchemaRegistryClient({'url': 'http://schema-registry:8081'}) # Define schema (or load from file) schema_str = """ { "type": "record", "name": "User", "fields": [ {"name": "userId", "type": "long"}, {"name": "name", "type": "string"}, {"name": "email", "type": ["null", "string"], "default": null} ] } """ # Create serializer (auto-registers schema with registry) avro_serializer = AvroSerializer(registry, schema_str) producer = SerializingProducer({ 'bootstrap.servers': 'kafka:9092', 'value.serializer': avro_serializer, }) # Produce a message — schema ID is prepended automatically producer.produce( topic='users', value={'userId': 42, 'name': 'Alice', 'email': 'alice@example.com'} ) producer.flush()
age of type int with no default value. What happens?Schema evolution does not exist in a vacuum. The way data flows between processes determines which compatibility properties matter and which encoding formats are practical. There are three fundamental modes of dataflow, each with different implications for encoding and evolution.
When you write to a database, you are encoding data for your future self — or for a colleague who will read it months or years from now. The writer encodes, and a potentially very different reader decodes. This means you need both forward compatibility (old application reading newly-written data) and backward compatibility (new application reading old data that was written years ago).
The most dangerous scenario: a v1 application reads a record written by v2 (which has extra fields), modifies an unrelated field, and writes it back. If the application does not preserve unknown fields, the v2 fields are silently erased. This is called the read-modify-write hazard, and it is the single most common source of silent data corruption in production systems.
In service-to-service communication, one process (the client) makes a request and another (the server) responds. The client and server may be running different versions of the code.
| Protocol | Encoding | Schema enforcement | Evolution |
|---|---|---|---|
| REST + JSON | JSON text | None (OpenAPI is documentation) | Ad-hoc versioning (URL path, headers) |
| gRPC | Protobuf binary | Strict (.proto file required) | Protobuf evolution rules |
| Thrift RPC | Thrift binary | Strict (.thrift file required) | Thrift evolution rules |
| GraphQL | JSON | Schema (SDL) | Add fields freely, deprecate old ones |
gRPC (Protobuf over HTTP/2) has become the standard for internal service-to-service communication. It is smaller and faster than REST+JSON, has built-in schema evolution via Protobuf, supports streaming (server-side, client-side, and bidirectional), and generates client/server code automatically from .proto files.
In asynchronous messaging (Kafka, RabbitMQ, SQS), the producer and consumer are fully decoupled. The producer may have been deployed weeks before the consumer. Messages may sit in the queue for hours or days. The consumer may be running code that was written before the new fields even existed.
This is where schema registries shine. The producer registers its schema (with compatibility checks), embeds the schema ID in each message, and the consumer resolves schemas at read time. Without a registry, you are flying blind — every consumer must somehow know which schema the producer was using, and version mismatches cause silent data corruption.
The simulation below shows data flowing through all three modes simultaneously. You can modify the schema (add a field) and watch how the change propagates. Notice where compatibility breaks happen and where the system handles evolution gracefully.
Add a field to the schema and watch it propagate through databases, APIs, and message queues. Observe compatibility outcomes.
This is so important that it deserves a concrete, step-by-step walkthrough:
The fix depends on the encoding format. Protobuf and Avro decoders can be configured to preserve unknown fields: when v1 code reads a record with unknown tags, it stores the raw bytes in a side buffer and re-emits them on write. This "round-trip safety" is not automatic in all implementations — you must verify that your serialization library supports it.
Theory is clean. Practice is messy. Let us look at how real systems combine encoding formats, handle versioning, and debug the inevitable breakages.
| Use case | Format | Schema management | Why |
|---|---|---|---|
| Service-to-service (internal) | gRPC (Protobuf) | .proto files in monorepo + buf.build | Strict types, code generation, streaming, small payloads |
| Event streaming | Kafka + Avro | Confluent Schema Registry (FULL mode) | Decoupled producers/consumers, automatic schema evolution |
| Mobile clients | Protobuf | .proto files bundled with app | Small payloads over cellular, strict types, backward compat for old app versions |
| Public REST APIs | JSON | OpenAPI spec (documentation, not enforcement) | Human readable, universal client support, curl-able |
| Data lake / batch | Avro or Parquet | Schema embedded in file header | Self-describing files, columnar for analytics (Parquet) |
| Configuration | JSON, YAML, TOML | None (human-edited) | Human readability trumps all other concerns |
Imagine you are the staff engineer responsible for data contracts across a platform of 50 microservices. Here is the system you would design:
These are the failure modes you will encounter in production (and in interviews):
Scenario 1: "Deserialization failures after deploying a new service version."
Root cause: a required field was added. Old messages in the Kafka topic (or old records in the database) do not have this field. The new code's deserializer throws because the required field is missing. Fix: make the field optional with a default. Redeploy. The error rate drops to zero because the deserializer now fills in the default for old records.
Scenario 2: "Silent data corruption — users report missing phone numbers."
Root cause: a v1 service reads records written by v2, modifies an unrelated field, and writes back without preserving unknown fields. phone_number (tag 3) is silently dropped on every round-trip through v1. Fix: upgrade all services to a Protobuf library version that preserves unknown fields. Add a monitoring check that detects field disappearance (compare field counts before and after writes).
Scenario 3: "Kafka consumer lag spike after producer schema change."
Root cause: the producer registered a new schema that is technically compatible, but the consumer's deserialization code has a bug: it tries to cast the new field to an incompatible type in application code (e.g., the schema says the field is a string, but the consumer's application code tries to parse it as an integer). The schema registry did not catch this because it only checks wire-format compatibility, not application-level semantics. Fix: add integration tests that deserialize sample messages with every consumer. Run these tests in the producer's CI pipeline before the schema change is merged.
Let us compare the same API call in both formats to make the trade-offs tangible.
protobuf // gRPC: Define the service in a .proto file service UserService { rpc GetUser(GetUserRequest) returns (User); rpc ListUsers(ListUsersRequest) returns (stream User); // server streaming! } message GetUserRequest { int64 user_id = 1; } message User { int64 user_id = 1; string name = 2; string email = 3; optional string phone = 4; }
python # gRPC client (auto-generated from .proto) import grpc import user_pb2, user_pb2_grpc channel = grpc.insecure_channel('user-service:50051') stub = user_pb2_grpc.UserServiceStub(channel) # Type-safe, IDE-autocompleted, binary-encoded response = stub.GetUser(user_pb2.GetUserRequest(user_id=42)) print(response.name) # "Alice" print(response.phone) # "" (default, field 4 missing in old data) # Server streaming: get 10K users without buffering all in memory for user in stub.ListUsers(user_pb2.ListUsersRequest(limit=10000)): process(user) # each user arrives as it's ready
python # REST equivalent — compare the developer experience import requests # No type safety, no auto-completion, text-encoded response = requests.get('http://user-service:8080/api/v1/users/42') data = response.json() # dict, not a typed object print(data['name']) # KeyError if field missing print(data.get('phone', '')) # manual default handling # No streaming: must buffer all 10K users in one JSON array response = requests.get('http://user-service:8080/api/v1/users?limit=10000') users = response.json() # entire 10K-user array in memory at once
| Dimension | gRPC + Protobuf | REST + JSON |
|---|---|---|
| Request size (GetUser) | ~5 bytes (varint 42) | ~45 bytes (URL + headers) |
| Response size (1 user) | ~30 bytes | ~120 bytes |
| Type safety | Compile-time (generated stubs) | Runtime (manual validation) |
| Streaming | Native (server, client, bidirectional) | Workarounds (SSE, chunked, WebSocket) |
| Schema evolution | Protobuf tag-based rules | Ad-hoc URL/header versioning |
| Human readability | No (binary, need grpcurl) | Yes (curl, browser, any HTTP tool) |
| Browser support | Limited (grpc-web proxy needed) | Native (fetch, XMLHttpRequest) |
The simulation below shows the same 1000-message workload processed by different format combinations. Compare throughput, payload size, and schema safety.
Compare JSON, MessagePack, Protobuf, and Avro across three dimensions.
This chapter distills everything into the cheat sheets, drills, and patterns you need for system design and coding interviews.
| Format | Wire format | Schema? | Field identity | Evolution mechanism | Typical size (our example) |
|---|---|---|---|---|---|
| JSON | Text | No (optional JSON Schema) | String names | Ad-hoc (versioned URLs) | ~45 bytes |
| XML | Text | Optional (XSD) | Tag names | XSD versioning | ~95 bytes |
| MessagePack | Binary | No | String names | Same as JSON | ~33 bytes |
| Thrift Binary | Binary | Required (.thrift) | Numeric tags | Tag-based skip | ~35 bytes |
| Thrift Compact | Binary | Required (.thrift) | Numeric tags | Tag-based skip | ~27 bytes |
| Protobuf | Binary | Required (.proto) | Numeric tags | Tag-based skip + defaults | ~11 bytes |
| Avro | Binary | Required (JSON/IDL) | Name resolution | Writer/reader schema match | ~10 bytes |
Q: "Design a schema evolution strategy for a microservices platform."
Answer framework: (1) Schema repo with CI-enforced compatibility checks (buf breaking for Protobuf, schema registry with FULL mode for Avro/Kafka). (2) Code generation pipeline publishing versioned stubs. (3) Runtime validation middleware with dead-letter queues. (4) Monitoring: deserialization error rate per service per topic. (5) Policy: all new fields optional with defaults, never reuse tag numbers, never change types.
Q: "Compare REST vs gRPC for internal services."
Answer: gRPC wins on all internal metrics: (1) Payload size: Protobuf binary is 4-10x smaller than JSON. (2) Parse speed: binary parsing is ~10x faster. (3) Schema enforcement: .proto files catch type errors at compile time, not runtime. (4) Code generation: stubs in any language for free. (5) Streaming: gRPC supports server/client/bidirectional streaming natively. (6) HTTP/2: multiplexed connections, header compression. REST wins for external APIs: human readable, universally supported, curl-testable, no build step for consumers.
Q: "A Kafka consumer is deserializing messages 10x slower than expected. Diagnose."
Answer framework: (1) Check if the consumer is hitting the schema registry on every message instead of caching. (2) Check if the consumer is using reflection-based deserialization instead of generated code. (3) Profile the deserialization: is it CPU-bound (parsing) or memory-bound (allocations)? (4) Check if the messages contain deeply nested Avro unions that require extensive resolution. (5) Compare against a baseline: deserialize 1M messages of the same schema and measure throughput.
Drill 1: Write a Protobuf schema with evolution.
protobuf // v1: initial schema syntax = "proto3"; message UserEvent { int64 user_id = 1; string action = 2; // "login", "purchase", etc. int64 timestamp = 3; // Unix millis } // v2: add optional metadata field message UserEvent { int64 user_id = 1; string action = 2; int64 timestamp = 3; string ip_address = 4; // NEW: optional (proto3 default) map<string, string> metadata = 5; // NEW: flexible key-value } // v3: deprecate a field, add another message UserEvent { int64 user_id = 1; string action = 2; int64 timestamp = 3; reserved 4; // ip_address removed, tag 4 reserved forever reserved "ip_address"; // name also reserved map<string, string> metadata = 5; string session_id = 6; // NEW field gets NEW tag number }
Drill 2: Implement varint encoder/decoder.
python def encode_varint(n): """Encode unsigned integer as Protobuf varint bytes.""" buf = [] while n > 0x7F: buf.append((n & 0x7F) | 0x80) n >>= 7 buf.append(n) return bytes(buf) def decode_varint(data): """Decode varint from bytes. Returns (value, bytes_consumed).""" result = 0 shift = 0 for i, byte in enumerate(data): result |= (byte & 0x7F) << shift if not (byte & 0x80): return result, i + 1 shift += 7 raise ValueError("Varint not terminated") def zigzag_encode(n): return (n << 1) ^ (n >> 63) def zigzag_decode(n): return (n >> 1) ^ -(n & 1) # Test round-trips for v in [0, 1, 127, 128, 300, 1337, 1000000]: encoded = encode_varint(v) decoded, length = decode_varint(encoded) assert decoded == v, f"Failed for {v}" print(f"{v:>10} -> {encoded.hex():12} ({length} bytes)") # Output: # 0 -> 00 (1 bytes) # 1 -> 01 (1 bytes) # 127 -> 7f (1 bytes) # 128 -> 8001 (2 bytes) # 300 -> ac02 (2 bytes) # 1337 -> b90a (2 bytes) # 1000000 -> c0843d (3 bytes)
This is the kind of question a senior interviewer at Google or Meta might ask. You are given raw Protobuf bytes and a schema. Decode the message.
| Dimension | Example question | What a staff answer includes |
|---|---|---|
| Concept | "Explain forward vs backward compatibility" | Concrete rolling deployment scenario, not just definitions. "During a 30-minute deploy, server 3 is v2 but server 5 is still v1..." |
| Design | "Design a data contract system for 50 services" | Schema repo + CI + registry + dead-letter queues + monitoring. Time horizons: this quarter, next generation, hardware transition. |
| Code | "Implement a varint encoder" | Clean code + edge cases (value=0, max int64) + test cases + complexity analysis (O(log n) bytes) |
| Debug | "Deserialization errors after deploy" | Systematic: check schema change → check compatibility mode → check if required field added → check consumer cache → check schema registry replication |
| Frontier | "What's next after Protobuf?" | Cap'n Proto (zero-copy), FlatBuffers (random access without parsing), Proto Editions (Google's next-gen), Apache Arrow (columnar in-memory format) |
Encoding and evolution do not exist in isolation. They interact deeply with every other layer of a data-intensive system. Here is where this chapter connects to the rest of the DDIA landscape.
| Related topic | Connection | Lesson |
|---|---|---|
| Data Models (Ch 3) | The data model determines what you serialize. Relational tables map naturally to Avro schemas. Document models (JSON) map to... JSON. Graph models need custom serialization for edges and vertices. | Data Models & Query Languages |
| Storage & Retrieval (Ch 4) | Storage engines use encoding internally. LSM-trees write SSTables in a binary format. B-trees store fixed-size pages. Column stores use specialized compression (run-length, bitmap, dictionary) that is itself a form of encoding. | Storage & Retrieval |
| Replication (Ch 6) | Replication logs must be encoded in a format that supports schema evolution. A follower running an older version must be able to read the leader's replication stream (forward compatibility). Statement-based replication encodes SQL text. Row-based replication encodes binary tuples. | Replication |
| Partitioning (Ch 7) | Partitioning keys must survive encoding round-trips. If your partition key is a 64-bit integer and your encoding truncates it to 32 bits, records will hash to the wrong partition. | Partitioning |
| Stream Processing (Ch 12) | Kafka + Avro + Schema Registry is the dominant pattern for stream processing. Every concept from this chapter — schema evolution, compatibility modes, schema resolution — is the foundation of a reliable streaming platform. | Stream Processing |
When choosing a format for a new system, walk this tree:
This chapter focused on the fundamental concepts. More advanced topics include:
"The key idea is that old and new code, old and new data, must coexist. Encoding formats are the contracts that make coexistence possible." — Martin Kleppmann, DDIA