Pattern · Reference

Database Selection

Every design hinges on a storage decision — start with access patterns, not database popularity.

Appears in · Every system design always discuss storage trade-offs

i. The Decision Framework

Ask these questions in order:

What is the access pattern? Key-value lookup? Range scan? Full-text search? Graph traversal?
What are the consistency requirements? Strong consistency or eventual OK?
What is the scale? Single node? Horizontal? Multi-region?
What is the read/write ratio? Read-heavy → indexes + replicas. Write-heavy → LSM, partitioning.
Do you need transactions? Multi-row ACID? Cross-shard?

ii. Postgres (SQL / Relational)

Best for: Structured data with relationships, ACID transactions, complex queries, moderate scale.

Strength	Detail
ACID transactions	Safe for financial data, inventory, anything requiring atomicity
Flexible querying	JOINs, aggregations, subqueries — no schema pre-planning needed
Mature ecosystem	pgvector, PostGIS, JSON support, rich extension library
Read replicas	Streaming replication for read scale-out

Weaknesses: Horizontal write scaling requires sharding (adds complexity). Schema migrations at scale are painful.

iii. Cassandra (Wide-Column / NoSQL)

Best for: Write-heavy, globally distributed, time-series, pure key-value at massive scale.

Strength	Detail
Horizontal scaling	Add nodes linearly; no primary bottleneck
Write throughput	LSM-tree writes to memory (MemTable) first → very fast
Multi-region	Built-in multi-master replication across DCs
Tunable consistency	Per-query: ONE, QUORUM, ALL
Time-series native	Wide rows model fits (user_id, timestamp) patterns

Weaknesses: No JOINs — must denormalize; one table per query pattern. No multi-row transactions. Data model design is query-driven — harder to evolve. Eventual consistency by default.

When to choose Cassandra: Write QPS > 10K/s sustained. Data fits partition key → clustering key → value. Multi-region is a hard requirement. You can live with eventual consistency.

URL Shortener fit — short_code as partition key, pure key-value → a perfect Cassandra use case.

iv. Redis (In-Memory / Cache + Data Structure Store)

Best for: Caching, sessions, pub/sub, leaderboards, counters, queues.

Strength	Detail
Sub-millisecond latency	All operations in-memory
Rich data structures	Sorted sets, streams, hyperloglogs, bloom filters
Atomic operations	INCR, LPUSH, etc. — race-condition-free counters
TTL support	Native expiry on any key

Weaknesses: Memory-limited. Persistence optional — not a true DB replacement. Single-threaded command execution.

Use as: Cache layer in front of any DB, session store, rate-limit counter, distributed lock (Redlock), real-time leaderboard.

v. DynamoDB (Managed Key-Value / Document)

Best for: Serverless, AWS-native, unpredictable traffic (auto-scaling), simple access patterns.

Strength	Detail
Fully managed	No ops burden
Auto-scaling	Handles traffic spikes automatically
DynamoDB Streams	CDC for downstream consumers
Global Tables	Multi-region replication

Weaknesses: Expensive at high sustained throughput. Query flexibility limited to partition + sort key. Vendor lock-in.

vi. Search & OLAP Stores

Elasticsearch / OpenSearch

Best for: Full-text search, log analytics, faceted search.

Inverted index → fast text search across millions of documents
Not a primary DB — typically used alongside Postgres/Cassandra as a search index
Eventual consistency; not ACID
Use for: product search, log/event analytics, autocomplete

ClickHouse / OLAP Stores

Best for: Analytics queries over billions of rows (aggregations, GROUP BY, time-series analytics).

Columnar storage → reads only needed columns
Excellent for: "count clicks per URL per day", "top 10 URLs by country"
Not for transactional workloads
Write in bulk (batch insert); real-time inserts degrade performance

vii. Sharding Strategies

Range Sharding

Shard 1: user_id 0–1M
Shard 2: user_id 1M–2M
...

Pros: Simple; range queries stay on one shard
Cons: Hot spots (newest users get all traffic)

Hash Sharding

shard = hash(user_id) % num_shards

Pros: Even distribution
Cons: Range queries span all shards; resharding requires data migration

Consistent Hashing

Virtual nodes on a ring → each node owns a range
Add/remove nodes: only adjacent ranges affected
Use when: Cache clusters, distributed KV stores, Cassandra (uses this internally)

Directory-Based Sharding

Lookup table maps entity → shard
Pros: Flexible; can move data without re-hashing
Cons: Lookup table is a bottleneck/SPOF if not cached

viii. Replication Patterns

Pattern	How	Use case
Leader-Follower	Writes to leader, replicas sync async	Most common; reads scale-out; leader is write bottleneck
Multi-Leader	Multiple write nodes	Multi-region writes; conflict resolution complexity
Leaderless (Quorum)	Write to W nodes, read from R nodes; W+R > N	Cassandra, DynamoDB; no SPOF; tunable consistency

ix. Quick Decision Table

Scenario	Pick
Financial transactions, complex queries	Postgres
High-write, time-series, multi-region	Cassandra
Caching, sessions, real-time counters	Redis
Full-text search	Elasticsearch
Analytics over billions of rows	ClickHouse
Serverless / AWS / variable traffic	DynamoDB
Graph relationships	Neo4j

x. Key Points

Never just say "I'll use MySQL" or "I'll use NoSQL" without justification
Walk through the framework: access pattern → consistency → scale → transactions
Mention that you'd start with Postgres and only move to Cassandra when scale demands it
Discuss sharding only when the data outgrows a single node — don't prematurely optimize
Interviewers love hearing: "I'd add a Redis cache in front of the DB for the read path"

❦