Vol. I  ·  Issue 01 ← back to cover MMXXVI
Issue 01 · Design

URL Shortener

A read-heavy redirect at planet scale, designed around a single elegant constraint: never make the user wait.

Read / write ratio 100: 1
Peak QPS (reads) 120K
Latency target < 10ms p99
Storage / 10 yr ~180TB
Encoding base62 · 7-char

i. Requirements

Functional

Non-Functional

Out of Scope

ii. Capacity Estimates

ParameterValue
New URLs per day100 million
New URLs per second (write QPS, avg)~1,200
New URLs per second (write QPS, peak ≈ 3×)~3,500
Redirect QPS (100× reads, avg)~120,000
Redirect QPS (peak)~360,000
URL record size~500 bytes (long URL + metadata)
Storage for 10 years100M × 365 × 10 × 500B ≈ 180 TB
Cache (top 20% → 80% traffic)hot set fits in Redis (~1 TB/day active)
Read bandwidth at peak (360K × ~600B request+response)~200 MB/s ≈ 1.6 Gbps egress per redirect tier

Per-tier sizing — back of the envelope

At staff level, the interesting math is not the headline QPS — it is how that QPS distributes across tiers once cache hits absorb the bulk.

TierLoad at peakSizing
Edge / CDN~70% of redirects (popular links)Anycast PoPs; no origin capacity needed for hits
Redirect service (app)~110K QPS post-CDN~40–60 pods at ~2K QPS each (Go/Rust); CPU-bound on TLS + JSON
Redis cluster~99K QPS (90% L2 hit)6–10 shards × replica, ~50K ops/sec/node headroom
DB read (post-cache)~10K QPS (1% miss × 110K)3–6 Cassandra nodes per region @ RF=3, or Postgres + 5–10 read replicas
Write service~3.5K QPS peak10–20 pods; bottleneck is DB write, not app
Token range broker (ZK/etcd)~1 claim per 1M IDs / server3 or 5-node ensemble; load is trivial — only sized for HA quorum

iii. High-Level Design

Client
  │
  ▼
[Geo-DNS / Anycast]  ──▶  [CDN Edge PoP]  ──hit──▶  302
  │ miss
  ▼
[Regional Load Balancer]
  │                     │
  ▼                     ▼
[Write Service]      [Redirect Service]  ──▶  [In-process LRU (L1)]
  │                     │                        │ miss
  │                     ▼                        ▼
[ZK/etcd range]      [Redis Cluster (L2)]  ──miss──▶  [DB Read Replica]
  │                     ▲                                      │
  ▼                     │ async fill / invalidate              │
[DB Primary]  ──────────┴──────────────────────────────────────┘
  │
  ▼ async
[Kafka]  ──▶  [Analytics OLAP]   |   [Replication → other regions]

Four caching layers sit between a redirect request and the database: browser cache (when using 301), CDN edge, in-process LRU on the redirect pod, then Redis. By the time a request reaches the database, three independent caches have agreed they do not know the answer — which itself is a useful signal (the URL is either cold or non-existent).

Write path

  1. Client POSTs long URL
  2. Write service calls token generator → gets a unique short code
  3. Writes {short_code → long_url, created_at, expiry} to DB primary
  4. Optionally pre-warms cache
  5. Returns short URL to client

Read path (redirect)

  1. Client GETs tinyurl.com/abc123
  2. Redirect service checks Redis cache first
  3. Cache hit → 301/302 redirect immediately
  4. Cache miss → query DB read replica → cache the result → redirect
Cache is not a database with worse durability. It is a different shape of correctness.

iv. Key Design Decisions

Short Code Length

Token Generation Strategy

See patterns/token-generation for all approaches. For URL shortener:

Chosen: Zookeeper-coordinated counter ranges

Why not MD5/SHA hash?

Hash of URL → take first 7 chars → collision risk + not human-stable. Same URL hashed twice = same code (good for dedup but complicates custom aliases).

Why not UUID?

128 bits → need to shorten anyway; introduces randomness we don't need.

301 vs 302 Redirect

301 Permanent302 Temporary
Browser caches?YesNo
Analytics possible?No — browser skips serverYes — every redirect hits server
Server loadLowerHigher

Database Choice

See patterns/database-selection for the full decision framework. The access pattern here is almost the textbook case for a wide-column store: a single equality lookup on a high-cardinality key, no joins, no range scans on the hot path, and a write that never updates an existing row.

Schema — Cassandra

CREATE TABLE urls (
  short_code   text PRIMARY KEY,
  long_url     text,
  user_id      bigint,
  created_at   timestamp,
  expires_at   timestamp,
  is_custom    boolean,
  is_disabled  boolean   -- soft-delete tombstone for abuse takedowns
) WITH compaction = {'class':'LeveledCompactionStrategy'}
  AND default_time_to_live = 0
  AND gc_grace_seconds = 864000;

short_code is the partition key — every redirect is a single-partition read, the cheapest operation Cassandra offers. Leveled compaction is chosen over Size-Tiered because reads are random and we want bounded read amplification (≤ ~10 SSTables touched per query in steady state).

Replication & consistency

Secondary access — queries by user_id

Cassandra's golden rule: one table per query pattern. Do not use a secondary index on user_id at scale — it scatter-gathers across nodes and degrades non-linearly. Instead, denormalize:

CREATE TABLE urls_by_user (
  user_id     bigint,
  created_at  timestamp,
  short_code  text,
  long_url    text,
  PRIMARY KEY ((user_id), created_at, short_code)
) WITH CLUSTERING ORDER BY (created_at DESC);

Both tables are written in the write path. The cost is ~2× write amplification, which is fine — writes are 1% of traffic. The benefit is that "list my links" becomes a single-partition range read.

Postgres alternative — when it's enough

Caching

See patterns/caching for full strategies. The right framing for this system is not "add a cache" but "build a four-layer cache hierarchy where each layer absorbs an order of magnitude more traffic than the next."

LayerWhereTarget hit rateLatency
L0 — BrowserClient (via 301 + Cache-Control)Variable — depends on user behavior0 ms (no network)
L1 — CDN edgeCloudFront / Fastly / Cloudflare PoP60–80% of redirects5–20 ms
L2 — In-process LRURedirect pod heap (Caffeine / Ristretto)~50% of post-CDN traffic< 0.01 ms
L3 — Redis clusterRegional Redis~90% of post-LRU traffic~0.5 ms
Origin — DBCassandra / Postgres≤ 1% of original request volume1–5 ms

v. Deep Dives

Custom Aliases

URL Expiration

Analytics (if in scope)

vi. Bottlenecks by Tier

Every system has a binding constraint. The interesting question at staff level is not "is this fast?" but "which tier saturates first, and what does it take to lift that ceiling?" Below is the redirect path ceiling at each layer with realistic single-instance numbers, and what you do when you hit them.

TierPractical ceilingBinding resourceLift
DNS / GSLBessentially unboundednone (delegated)Use anycast + edge resolvers; never homegrown
L7 load balancer (single)~100K RPS, ~10 GbpsCPU on TLS termination, conntrackHorizontal scale; offload TLS to dedicated tier; use HTTP/2 multiplexing
Redirect pod~2–5K RPS (Go/Rust), ~500 RPS (Node/Python)CPU; allocations on hot pathAdd pods; zero-allocation request path; HPA on CPU + p99 latency
In-process LRU (L2)millions of ops/sec/podnone — heap-boundSize to ~100K hottest keys; Caffeine/Ristretto, never map[string]string
Redis node~80–100K ops/secSingle-threaded command loopShard by short_code hash; add replicas for reads; pipeline batched lookups
Cassandra node~10–30K reads/sec; ~30–50K writes/secDisk IOPS on reads; compaction on writesAdd nodes (linear); LCS for read-heavy; SSDs/NVMe always
Postgres primary (single)~5–10K writes/sec, ~20–40K reads/secWAL fsync; lock contentionRead replicas; logical sharding by short_code prefix; eventually migrate
ZK / etcd range broker~1 RPS (claims are rare)none3 or 5-node ensemble; pre-claim ranges on pod startup
Cross-region linkRTT, not bandwidth~150 ms RTTRegion-local reads; async replication; never call across regions on hot path

vii. Hot Keys & Viral URLs

A Super Bowl ad goes live with tinyurl.com/sb-ad. Within 30 seconds that one short code accounts for 10M hits/min — 90% of total system traffic on a single key. This is the classic hot key problem, and it is the single failure mode most likely to bring a URL shortener down in production.

Why it hurts

Mitigations — layered

  1. CDN absorbs the brunt. A correctly configured CDN sees one origin pull per PoP per TTL window. At ~300 PoPs and a 5-minute TTL, the origin sees ~3,600 pulls/hour for a hot URL regardless of how viral it gets
  2. In-process LRU with long TTL. A 100K-entry LRU per pod means a viral key sits permanently in memory after the first request. Use Caffeine (Java), Ristretto (Go), or moka (Rust) — admission-based caches resist scan pollution
  3. Key splitting (hot key sharding). Maintain N replicas of the hot key in Redis (url:abc:0url:abc:9) and route requests by request hash. Spreads single-key load across N shards. Implement only after detection — most keys don't need this
  4. Read from any replica. In Cassandra, LOCAL_ONE reads can land on any of the RF=3 replicas — three nodes serve the single hot partition concurrently. Add replicas (RF=5) for known-viral campaigns
  5. Hot key detection. Sample 1% of requests, count by short_code in a sliding window (count-min sketch in Redis or in-memory). When a code exceeds a threshold (e.g., 1K QPS to origin), auto-promote it to in-process LRU with infinite TTL and increase CDN TTL via API
  6. Pre-warming. For known campaigns (Super Bowl, product launch), the customer can request pre-warming — the system pushes the entry into every pod's L2 and Redis before the campaign starts

viii. Multi-Region Architecture

A global URL shortener cannot serve every redirect from one region — a request from Tokyo to a US-East origin pays ~150 ms RTT before the first byte. The interesting design question is what is regional-local vs globally coordinated.

Routing

Write replication

Token range coordination

RPO & RTO

Failure scenarioRPORTO
AZ failure within region0 (LOCAL_QUORUM survives)~30 s (LB health check + retry)
Full region failure~1 s (async replication lag)~5 min (DNS failover) or instant (anycast)
Cassandra cluster corruption (logical)up to backup intervalhours (restore + replay)

ix. Failure Modes & Mitigations

The four-row table from a junior design is replaced here with a systematic walk through what breaks in production, why, and what the mitigation looks like. Group by where the failure originates.

Infrastructure failures

FailureBlast radiusMitigation
Single redirect pod OOM / crashOne pod's in-flight requests failMultiple replicas behind LB; readiness probes; circuit breakers in clients; pod disruption budget > 1
Single Redis shard down~1/N of cached keys evaporate; reads fall through to DBRedis Cluster with replicas (1 primary + 1 replica per shard); auto-failover via Sentinel/Cluster gossip; DB must absorb the surge — see "cache miss storm" below
Entire Redis cluster downAll L3 traffic falls through to DB (~10× expected DB load)L2 in-process LRU continues to absorb the hottest keys; DB has 2× headroom; concurrency limiter on app→DB connections drops excess requests with 503 rather than queuing
DB primary down (Postgres)Writes fail; reads on replicas still workPatroni / Stolon orchestrated failover; promote replica in ~30 s; client-side connection retries with exponential backoff
Cassandra node downOne of three replicas unavailable for that partitionQUORUM still satisfied with 2/3; node replacement within 24h before hint expiry; repair on rejoin
Full AZ outage1/3 of capacity offlineMulti-AZ deployment with 50% spare capacity per zone; cross-AZ LB routing
Full region outageAll requests in that regionAnycast / DNS failover to another region; region must be sized for ~1.5× steady state to absorb the spillover
ZK / etcd quorum lossNew token range claims failEach pod pre-claims a 1M-ID range at startup → ~14 minutes of write headroom at 1.2K QPS before any pod needs to re-claim; ZK quorum restored long before that
CDN partial outageCache miss surge to origin (3–10× normal)Multi-CDN strategy (Fastly + CloudFront active-active by DNS); origin shield layer between CDN and origin to dedupe surges

Data-path pathologies

FailureSymptomMitigation
Cache stampede on TTL expiry10K simultaneous misses on the same key → 10K DB readsSingleflight per pod; Redis-side mutex via SET NX; stale-while-revalidate at CDN; jittered TTL
Hot key saturating one Redis shardOne CPU core at 100%, p99 spikes on that shardL2 in-process LRU; key splitting (replicate hot key N ways); auto-detect via sampled counters
Cache miss storm post-deployFresh pods have empty L2 → temporary surge to L3 and DBRolling deploy with low surge (10% at a time); request shadowing to warm new pods before they take traffic; readiness probe gates on cache warm
Replication lag spike (Postgres)Read-replica serves stale data; "I just created it but get 404"Route the immediate post-write read to the primary for N seconds (read-your-writes); monitor replication lag and remove replica from LB pool when lag > threshold
Cassandra read amplification on SSTable buildupp99 read latency climbs with compaction backlogMonitor pending compactions; throttle writes during incidents; increase compaction throughput; size disk for 2× working set
Token range exhaustion at one podPod blocks new writes mid-rangeClaim next range proactively at 80% consumed, not 100% — overlap the claim with continued writes
Clock skew between pods (if Snowflake used)Out-of-order or colliding IDsNTP enforcement; refuse to issue IDs when clock drift > 50 ms; counter-range strategy avoids this entirely (it's deterministic)
Hash collision (hash-based generation only)Wrong long URL returned — silent correctness bugDon't use truncated hashes for primary generation; if dedup is wanted, store and check, never trust the hash alone
Poison row (single corrupted record)Repeated retries amplify loadPer-key circuit breaker; quarantine the row; emit alert with short_code so operations can investigate
Disk full on Cassandra nodeWrites fail; node may enter read-only or crashCapacity planning at 50% disk utilization steady state; alerts at 70%; archival of expired URLs to cold storage

Operational & deployment failures

FailureSymptomMitigation
Bad deploy crashes redirect podsError rate spike, latency spikeCanary deploy (1% → 10% → 100%); automatic rollback on SLO violation; deploy frozen during high-traffic windows
Schema migration on CassandraInconsistent schema across nodesMigrations gated through a coordinator; schema agreement check before considering migration complete
Dependency upgrade (Redis 7→8) misbehaviorSubtle correctness issuesShadow cluster running new version; mirror 1% of traffic; compare responses before cutover
Backup restore wipes recent writesData lossPoint-in-time recovery (continuous WAL archiving); commit-log replay for Cassandra; pre-restore snapshot before any destructive operation
Runaway analytics consumer falls behindKafka backlog grows, eventual storage pressureClick events use dedicated topic with separate retention; analytics is async and never blocks the redirect

x. Security & Abuse

A URL shortener is a redirect-as-a-service that anyone on the internet can write to. That makes it a magnet for abuse, and the abuse pathway is usually more damaging to the business than the technical failure modes. Staff-level system design accounts for this from day one.

Threat model

ThreatRiskMitigation
Phishing / malware redirectBrand damage; potential CSAM / regulatory exposureScan target URL at write time against Google Safe Browsing, PhishTank, and an internal blocklist; async re-scan periodically; is_disabled tombstone for takedowns (instant, no DB delete)
Open-redirect abuse for phishingAttacker uses your domain to mask a malicious URL in an emailInterstitial warning page for URLs with low age + low click count; reputation scoring
Enumeration of sequential short codesScraping all URLs exposes private linksCounter-based IDs leak business metrics (total URL count) and are enumerable. Mitigations: (a) randomize within the assigned range, (b) sparse base62 with a permutation step, or (c) use UUIDv7 prefix + truncate for the short code. Trade-off: harder to debug
DDoS on a single short codeHot-key amplification used as an attack vectorPer-IP and per-short-code rate limiting at the edge; CAPTCHA challenge above threshold; the CDN must absorb — never let the attack reach origin
DDoS on write endpointRange exhaustion, DB write saturationHard per-IP write rate limit (e.g., 60/min anonymous, higher with API key); CAPTCHA for anonymous writes; bot detection (header/timing heuristics)
Custom alias squattingReserving common words / brand namesReserved-word blocklist (legal + product); rate-limit custom aliases (1 per minute per user); paid tier for premium aliases
Long URL points back to the shortenerRedirect loop, DoS amplificationResolve target URL at write time; reject self-referential or loop-prone targets; cap redirect chain depth at the client level
GDPR / right-to-erasureUser requests deletion of their links and click historyPer-user deletion API; is_disabled = true on URLs; tombstone propagates to caches via invalidation; analytics anonymized at ingestion (no raw IPs after N days)
Audit / forensicsLaw enforcement requests, takedown traceabilityAppend-only audit log of writes and takedowns; retention policy aligned with jurisdiction; signed log entries

Takedown propagation

When abuse is detected, the URL must stop redirecting fast. The flow:

  1. Set is_disabled = true in DB (single write, immediately consistent within region)
  2. Publish invalidate:url:{code} to a Redis pub/sub channel — all redirect pods drop their L2 entry
  3. DEL url:{code} in Redis cluster
  4. Issue a CDN purge for the redirect URL via the CDN API
  5. The whole flow completes in < 5 s globally; downstream redirects return a takedown page (HTTP 410 Gone, not 404)

xi. Observability & SLOs

Service-level objectives

SLITarget (28-day window)Error budget
Redirect availability (HTTP 2xx/3xx ratio)99.99%~4.3 minutes / month
Redirect p50 latency (server-side)< 5 ms
Redirect p99 latency< 10 ms
Redirect p99.9 latency< 50 ms
Write availability99.9%~43 minutes / month — lower tier than reads
Write p99 latency< 100 ms

Golden signals — per service

Key alerts

Tracing & debugging

xii. Key Takeaways

  1. Read-heavy → build a four-tier cache hierarchy. CDN → in-process LRU → Redis → DB. Each tier absorbs an order of magnitude. The CDN does the heaviest lifting.
  2. Base62 + counter beats hashing for uniqueness guarantees at scale. Counter ranges via ZK/etcd eliminate per-write coordination.
  3. Cassandra with short_code as partition key is the textbook fit. RF=3, LOCAL_QUORUM writes, LOCAL_ONE reads, LWT only for custom alias races.
  4. Cache is an optimization, never a dependency. Size the DB tier to absorb the cache-down case, or you've built a system that secretly requires every component to be healthy.
  5. Hot keys are the asymmetric risk. 99.9% of keys behave; the 0.1% that go viral can saturate one Redis shard, one Cassandra partition, one anything. Layer mitigations: CDN, in-process LRU, key splitting, replica fan-out.
  6. The redirect path is region-local. Anycast routing, region-local replicas, async cross-region replication. Never call across regions on the hot path.
  7. Abuse is the real failure mode. Phishing scans at write time, takedown propagation in seconds, enumeration-resistant ID schemes, rate limiting at the edge.
  8. Separate write and redirect services — they have completely different scaling profiles, SLOs, and failure tolerances. Don't deploy them together.
  9. 301 vs 302 is a real trade-off. 301 minimizes server load and enables browser caching; 302 keeps analytics flowing. The right answer depends on your product.
  10. Observability buys you the budget to ship boldly. SLOs with error budgets, golden signals at every tier, traces that span CDN to DB, and alerts that fire on burn rate — not on threshold crossings.

xiii. Go Deeper