Chapter 4 - Designing Data-Intensive Applications
Syan S.P · March 12, 2025
Chapter 4 Encoding and Evolution
Evolvability
- Systems should be designed to easily adapt to change.
- Rolling upgrades allow new and old code and data formats to coexist during transitions.
- Compatibility is essential:
- Backward compatibility: New code can read old data (generally easier).
- Forward compatibility: Old code can read new data (more difficult).
Data Encoding Basics
- Two main representations:
- In-memory data structures optimized for fast CPU access.
- Byte sequences used for transmission or storage (e.g., JSON).
- Translation between these involves:
- Encoding/serialization: converting memory structures to bytes.
- Decoding/deserialization: converting bytes back into memory structures.
- Examples include Java Serializable and Ruby Marshal.
- Problems with these include tight coupling to specific languages, poor cross-language interoperability, versioning difficulties, and inefficiency.
- Widely used for data interchange but often criticized for verbosity or ambiguity.
- Issues include lack of true binary support, ambiguous number handling, and optional schemas.
- More compact and efficient than text-based formats like JSON or XML.
- Examples include BSON (MongoDB), Thrift (Facebook), Protocol Buffers (Google), and Avro (Hadoop).
- Support schema evolution to handle backward and forward compatibility.
Modes of Dataflow
- Data must be encoded as bytes to flow between processes and systems.
Dataflow Through Databases
- Requires both backward and forward compatibility because of rolling upgrades.
- Schema evolution is often handled by formats like Avro.
Dataflow Through Services (REST & RPC)
- REST: HTTP-based web services with clear client-server roles.
- RPC: Remote Procedure Calls make remote calls look like local function calls but face challenges such as network unreliability, idempotency, and cross-language issues.
- Modern RPC frameworks (e.g., Finagle, gRPC) explicitly address the networked nature of calls.
Message-Passing Dataflow
- Asynchronous communication via message brokers or queues.
- Benefits include buffering, decoupling sender and receiver, retries, and multicast.
- Communication patterns are usually asynchronous, meaning send-and-forget.
Message Brokers
- Named queues or topics deliver messages to consumers.
- Support multiple producers and multiple consumers per topic.
Distributed Actor Frameworks
- Actors are independent units that communicate via asynchronous messages.
- Useful for concurrency and scaling across distributed nodes.
- Examples include Akka, Orleans, and Erlang OTP.