Chapter 4 - Designing Data-Intensive Applications

Syan S.P · March 12, 2025

Chapter 4 Encoding and Evolution

Evolvability

  • Systems should be designed to easily adapt to change.
  • Rolling upgrades allow new and old code and data formats to coexist during transitions.
  • Compatibility is essential:
    • Backward compatibility: New code can read old data (generally easier).
    • Forward compatibility: Old code can read new data (more difficult).

Data Encoding Basics

  • Two main representations:
    • In-memory data structures optimized for fast CPU access.
    • Byte sequences used for transmission or storage (e.g., JSON).
  • Translation between these involves:
    • Encoding/serialization: converting memory structures to bytes.
    • Decoding/deserialization: converting bytes back into memory structures.

Language-Specific Formats

  • Examples include Java Serializable and Ruby Marshal.
  • Problems with these include tight coupling to specific languages, poor cross-language interoperability, versioning difficulties, and inefficiency.

Standard Formats: JSON, XML, CSV

  • Widely used for data interchange but often criticized for verbosity or ambiguity.
  • Issues include lack of true binary support, ambiguous number handling, and optional schemas.

Binary Encoding Formats

  • More compact and efficient than text-based formats like JSON or XML.
  • Examples include BSON (MongoDB), Thrift (Facebook), Protocol Buffers (Google), and Avro (Hadoop).
  • Support schema evolution to handle backward and forward compatibility.

Modes of Dataflow

  • Data must be encoded as bytes to flow between processes and systems.

Dataflow Through Databases

  • Requires both backward and forward compatibility because of rolling upgrades.
  • Schema evolution is often handled by formats like Avro.

Dataflow Through Services (REST & RPC)

  • REST: HTTP-based web services with clear client-server roles.
  • RPC: Remote Procedure Calls make remote calls look like local function calls but face challenges such as network unreliability, idempotency, and cross-language issues.
  • Modern RPC frameworks (e.g., Finagle, gRPC) explicitly address the networked nature of calls.

Message-Passing Dataflow

  • Asynchronous communication via message brokers or queues.
  • Benefits include buffering, decoupling sender and receiver, retries, and multicast.
  • Communication patterns are usually asynchronous, meaning send-and-forget.

Message Brokers

  • Named queues or topics deliver messages to consumers.
  • Support multiple producers and multiple consumers per topic.

Distributed Actor Frameworks

  • Actors are independent units that communicate via asynchronous messages.
  • Useful for concurrency and scaling across distributed nodes.
  • Examples include Akka, Orleans, and Erlang OTP.

Twitter, Facebook