Data Models
The long wait is over, it is finally here! No, not Winds of Winter, but the second edition of Designing Data Intensive Applications.
Here are my notes on data models. Very brief summary, trade-offs, use cases and databases that are designed for them. There's a lot more in the book that I don't cover here, like historic data models like Triple Stores, in deep examples of everything and data models specifically for OLAP systems, which are outside the scope of my work as a backend engineer.
Relational Model
Data is organized into relations (tables), where each relation is an unordered collection of tuples (rows). Queried using SQL (declarative). The dominant model since the mid-1980s.
Pros:
- Strong support for joins, many-to-one, and many-to-many relationships
- Schema-on-write enforces data consistency
- Mature query optimizer handles complex queries efficiently
- Normalization avoids data duplication and inconsistency
- Secondary indexes allow efficient querying in both directions of a relationship
Cons:
- Impedance mismatch with object-oriented application code (requires ORM or manual translation)
- Shredding tree-structured data across multiple tables creates cumbersome schemas
- Rigid schema makes evolving data format operationally challenging on large tables
- Recursive/variable-length path queries (e.g., graph traversals) are verbose and awkward (WITH RECURSIVE)
Typical use cases:
- OLTP business applications (users, orders, products, payments)
- Data warehousing and business analytics (star/snowflake schemas)
- Any domain with significant many-to-many relationships
Databases: PostgreSQL, MySQL, SQLite, Oracle, SQL Server, CockroachDB, Google Spanner, Amazon Aurora, SingleStore
Document Model
Data is represented as self-contained JSON (or XML/BSON) documents, typically with a tree structure of nested objects and arrays. Often associated with schema-on-read and denormalization.
Pros:
- Better data locality: all related data in one document, fetched in one read
- Closer to object structures in application code (reduced impedance mismatch)
- Schema flexibility (schema-on-read): easy to evolve data format without migrations
- Natural fit for one-to-many / one-to-few relationships (tree-structured data)
- Good for user-defined ordering (arrays preserve order)
Cons:
- Weak or no support for joins in some implementations (must join in application code)
- Cannot refer directly to nested items within a document (some can, like MongoDB or PostgreSQL jsonb)
- Entire document must be rewritten on update (wasteful for frequent small updates)
- Many-to-many and many-to-one relationships are awkward: requires cross-document references
- Denormalization risks inconsistency when duplicated data needs updating
Typical use cases:
- User profiles, résumés, product catalogs (tree-structured, loaded as a whole)
- Content management systems
- Event/activity logs
- Any domain where data is heterogeneous or externally determined
Databases: MongoDB, Couchbase, Amazon DocumentDB, RethinkDB, CouchDB. Also: JSON column support in PostgreSQL, MySQL, SQLite, Oracle
Property Graph Model
Data is modeled as vertices (nodes) and edges (relationships), each with a label and key-value properties. Any vertex can connect to any other vertex. Queried with languages like Cypher, GQL.
Pros:
- Extremely flexible: heterogeneous data (people, locations, events) in one graph
- Efficient traversal of relationships (follow edges forward and backward)
- Variable-length path queries are natural and concise (e.g.,
:WITHIN*0..) - Great evolvability: easy to extend with new vertex/edge types as requirements change
- Multiple relationship types coexist in the same structure
Cons:
- Edges connect only two vertices (no native higher-degree relationships without workarounds)
- Less mature tooling and ecosystem compared to relational databases
- Can be harder to reason about performance and indexing for complex queries
- Not well suited for tabular/analytical workloads
Typical use cases:
- Social networks (people know people)
- Knowledge graphs and search engines (entities and their relationships)
- Fraud detection (tracing chains of transactions)
- Network/infrastructure topology
- Genealogical data, recommendation engines
- Any domain where anything is potentially related to everything
Databases: Neo4j, Memgraph, KùzuDB, Amazon Neptune, TigerGraph, Apache AGE (on PostgreSQL)
Event Sourcing (with CQRS)
Data is written as an append-only log of immutable events (the source of truth). Read-optimized materialized views are derived from the event log. Write model and read models are separated (CQRS).
Pros:
- Events communicate intent clearly ("booking was canceled" vs. cryptic row updates)
- Materialized views are reproducible: delete and recompute from the same events
- Multiple views optimized for different query patterns, using any data model
- Easy to add new features by chaining behaviors off existing events
- Errors can be corrected by appending deletion/compensation events
- Built-in audit log
- High write throughput (sequential append)
Cons:
- External/non-deterministic data (e.g., exchange rates) must be captured at write time for reproducibility
- GDPR deletion of personal data in immutable logs is complex (requires crypto-shredding or per-user logs)
- Reprocessing events with external side effects (e.g., sending emails) requires care
- Additional complexity of maintaining derived views and ensuring they process events in order
- Eventual consistency between the event log and materialized views
Typical use cases:
- Complex business domains (conference management, e-commerce orders, financial transactions)
- Systems requiring full audit trails (regulated industries)
- Domains where requirements evolve frequently and new read patterns emerge
- High-write-throughput systems with bursty load
Databases/Frameworks: EventStoreDB, MartenDB (on PostgreSQL), Axon Framework, Apache Kafka (as event log), Datomic