Vol. I  ·  Issue 02 ← back to cover MMXXVI
Issue 02 · Design

Paste Bin

URL shortener plus content. The moment the payload outgrows the row is the moment your architecture changes entirely.

Read / write ratio 10: 1
Peak QPS (reads) 3.5K
Latency target < 50ms p99
Avg paste size 10KB
Storage / 10 yr ~3.6PB

i. Requirements

Functional

Non-Functional

Out of Scope

ii. Capacity Estimates

ParameterValueNotes
New pastes per day10 million1/10th of TinyURL — content is costlier to produce
Write QPS (avg)~11510M / 86,400
Write QPS (peak ≈ 3×)~350
Read QPS (10:1, avg)~1,150
Read QPS (peak)~3,500
Average paste size10 KBMix of tiny snippets and larger files; 95th pct < 100 KB
Max paste size10 MBHard cap; reject at ingress
Metadata record size~1 KBIDs, timestamps, TTL, content_key, owner, size
Content storage / year10M × 365 × 10KB = ~36.5 TB
Content storage / 10 years~365 TB to ~3.6 PBLower bound assumes aggressive expiration; upper bound is worst-case infinite retention
Metadata storage / 10 years~36 TBFits in Cassandra comfortably
Read bandwidth at peak3,500 × 10 KB = 35 MB/s ≈ 0.28 GbpsWithout CDN; with CDN the origin sees 10–20% of this

Per-tier sizing — back of the envelope

The read QPS looks modest (3,500) but the byte rate is not. CDN absorption is the single biggest lever, especially since popular pastes are fetched repeatedly.

TierLoad at peakSizingBinding resource
CDN / Edge~80% of reads (public popular pastes)Anycast PoPs; no origin capacity needed for hitsEgress bandwidth
Read service (app)~700 QPS post-CDN5–10 pods; trivially small — the bottleneck is I/O not CPUOpen file handles / network to S3
Redis cluster~630 QPS (90% L2 hit on post-CDN traffic)2–4 shards; small pastes (<4 KB) inline, larger just cache metadataMemory (~10 GB/shard)
Object storage (S3)~70 QPS (10% cache miss)No sizing needed — S3 auto-scales; cost is the real metricGET latency (~5–30 ms)
Metadata DB (Cassandra)~70 QPS reads, ~350 QPS writes3 nodes per region, RF=3Write IOPS
Write service (app)~350 QPS peak3–5 pods; bottleneck is S3 PUT + DB write, not app CPUS3 PUT latency
Expiry workerBackground; bursty on TTL boundaries1–2 pods per region; reads expiry queue, deletes S3 + DBS3 DELETE throughput

iii. High-Level Design

Client
  │
  ▼
[Geo-DNS / Anycast]
  │
  ▼
[CDN Edge PoP]  ──hit (public paste)──▶  response (content from edge cache)
  │ miss
  ▼
[Regional Load Balancer]
  │                          │
  ▼                          ▼
[Read Service]          [Write Service]
  │                          │
  ├─▶ [Redis L2]             ├─▶ [ID Generator (base62)]
  │    hit → return          ├─▶ [Object Storage: S3/GCS]  ──▶  content bytes
  │    miss ↓                └─▶ [Metadata DB: Cassandra]  ──▶  row (id, ttl, key, …)
  ├─▶ [Metadata DB]
  │    get content_key
  │
  └─▶ [Object Storage: S3/GCS]  ──▶  stream content bytes to client

[Expiry Worker] ──reads TTL queue──▶ deletes S3 object + DB row

The core split: metadata (paste ID, expiry, owner, content_key, size, language) lives in Cassandra; content bytes live in object storage (S3). The read service stitches them together. This split is the central design decision — everything downstream follows from it.

For small pastes (<4 KB), content is stored inline in the Cassandra row and the S3 hop is skipped entirely. The 4 KB threshold keeps the DB row under the Cassandra recommended limit for inline blobs while eliminating the object storage round-trip for the majority of pastes (code snippets, config fragments, short logs).

iv. Key Design Decisions

ID scheme

Same base62 approach as the URL shortener — a 7-character slug gives 62⁷ ≈ 3.5 trillion unique IDs, enough for any realistic horizon. ID generation uses a distributed counter with ZooKeeper-coordinated ranges: each write pod claims a range of 1 million IDs, burns through them locally (no coordination per write), then claims another. Collision is structurally impossible within a range.

For custom aliases, the write path does a Cassandra LWT (lightweight transaction) INSERT IF NOT EXISTS on the alias. Races are rare; LWT handles them correctly without distributed locks.

The inline vs. object storage split

Paste sizeContent storageRead pathRationale
< 4 KBCassandra row (content_inline blob column)Single DB read, no S3 hopEliminates extra round-trip; ~50–60% of all pastes by count
4 KB – 10 MBS3 object; Cassandra row stores content_keyDB read for metadata + S3 GET for bytesKeeps rows small; S3 is cheaper and more durable per byte than DB storage

Database schema (Cassandra)

Cassandra is chosen over Postgres for the same reasons as the URL shortener: the access pattern is a pure point lookup by paste_id, writes are heavy (10M/day × RF=3 = 30M physical writes), and the 10-year storage horizon demands horizontal scaling. A Postgres primary + replicas would work for the first two years; migrate at ~500M rows.

-- Primary table: all reads go here
CREATE TABLE pastes (
  paste_id      text PRIMARY KEY,       -- base62 slug, e.g. "aB3kZ9m"
  created_at    timestamp,
  expires_at    timestamp,              -- null = never expires
  owner_id      text,                   -- null for anonymous
  language      text,                   -- "python", "sql", null
  title         text,
  size_bytes    int,
  content_type  text,                   -- "inline" | "s3"
  content_key   text,                   -- S3 key if content_type = "s3"
  content_inline blob,                  -- raw bytes if content_type = "inline"
  password_hash text,                   -- bcrypt hash; null = public
  burn_on_read  boolean,
  view_count    counter                 -- approximate; see note
) WITH default_time_to_live = 0        -- TTL managed at app layer, not Cassandra native
  AND compaction = { 'class': 'LeveledCompactionStrategy' };

-- Expiry index: drives the cleanup worker
CREATE TABLE pastes_by_expiry (
  expiry_bucket text,    -- "2026-05-28T14" (hour granularity)
  expires_at    timestamp,
  paste_id      text,
  PRIMARY KEY (expiry_bucket, expires_at, paste_id)
) WITH CLUSTERING ORDER BY (expires_at ASC, paste_id ASC);

-- Owner index: "my pastes" lookup (secondary access pattern)
CREATE TABLE pastes_by_owner (
  owner_id   text,
  created_at timestamp,
  paste_id   text,
  title      text,
  PRIMARY KEY (owner_id, created_at, paste_id)
) WITH CLUSTERING ORDER BY (created_at DESC, paste_id ASC);

Notes on schema decisions: Cassandra native TTL is not used for the main table because it fires on a per-row clock that cannot be changed after write. If an owner extends a paste's expiry, a native TTL would silently delete it anyway. App-layer expiry (checked at read time + async worker deletion) is more flexible. The view_count is a Cassandra counter column — approximate under concurrent increments, but accurate enough for analytics and "popular" ranking.

Caching strategy

LayerWhat is cachedTTLTarget hit rateLatency
Browser cachePublic paste content (Cache-Control: max-age)Matches paste expiry or 1h for permanentHigh for repeated local views0 ms
CDN edgePublic paste responses (full HTTP response)Matches paste expiry; private/password pastes excluded via Cache-Control: private~80% of public reads~5–15 ms
In-process L1 (app pod)Metadata for recently accessed pastes; small (<4 KB) pastes in full30s with jitter ±5s~10% of post-CDN traffic<1 ms
Redis L2Metadata + inline content; large pastes: metadata only + pre-signed S3 URLTTL = min(paste expiry – now, 24h) with ±10% jitter~85% of post-CDN traffic~1–3 ms
Object storage (S3)Source of truth for large paste contentPermanent (managed by lifecycle rules)~5% of post-CDN traffic~10–50 ms

Negative caching: A missing paste_id (404) is cached in Redis for 60s with a sentinel value. Without this, a scan-and-fetch loop against random IDs would hammer Cassandra on every miss.

Stampede protection: The read service uses Go's singleflight in-process for concurrent identical requests (a popular paste shared via social media sees a burst of simultaneous first-loads). At Redis level, only one request holds a distributed mutex to rebuild the cache entry; others wait with a 200ms timeout, then fall through to S3 directly if the mutex holder times out.

Pre-signed S3 URLs: For large pastes, Redis caches a pre-signed S3 GET URL (valid 15 minutes) rather than the bytes themselves. The client is redirected to S3 directly, offloading bandwidth from the app tier entirely. This is the right trade-off when paste size is large enough that Redis memory cost exceeds S3 GET cost.

v. Deep Dives

Expiration pipeline

Expiry is the feature that quietly makes the system hard. A paste with expires_at = T must stop being readable at T, and its bytes must eventually be deleted from S3 to reclaim storage. There are three components:

  1. Read-time enforcement: Every read checks expires_at against now() in the metadata returned from Cassandra or Redis. Expired pastes return 404 immediately. This is the correctness layer — it fires even if the async cleanup worker is lagging.
  2. Expiry worker (async cleanup): A background service reads pastes_by_expiry in hour-bucket order. For each expired paste it: (a) deletes the S3 object if content_type = "s3", (b) deletes the Cassandra rows across all three tables, (c) invalidates CDN cache via purge API, (d) deletes from Redis. Order matters: S3 first, then DB. If the worker crashes between S3 delete and DB delete, a subsequent read gets a 404 from the DB (expired) or a 404 from S3 (object gone) — either way correct. The reverse order (DB then S3) would leave orphaned bytes on S3 indefinitely.
  3. S3 lifecycle rules as backstop: A lifecycle rule on the S3 bucket tags objects at creation with expires_at and sets a lifecycle policy to delete objects after their tag date. This is a belt-and-suspenders measure — it catches anything the worker missed due to a multi-day outage.

Burn-after-read

A paste with burn_on_read = true is deleted immediately after the first successful read. The implementation requires care: reading and deleting must be atomic-ish. The read service: (1) fetches metadata from DB, (2) checks burn flag, (3) if set, issues a Cassandra DELETE before returning the content to the client, (4) purges CDN and Redis. The window between step 3 and the CDN purge propagating is typically < 5 seconds. If two readers race, one wins and the other gets a 404 — acceptable behavior, documented in the product.

Password-protected pastes

The password hash (bcrypt, cost factor 12) is stored in the Cassandra row. On read, the client submits the passphrase in a POST body (never in the URL — URLs end up in logs). The read service bcrypt-compares and returns 401 on mismatch. Password-protected pastes are excluded from CDN and Redis caching (served from origin only). Pre-signed S3 URLs are not used for password-protected pastes — the bytes must not be directly reachable without auth.

Syntax highlighting

Done entirely client-side using a library (Prism.js or highlight.js loaded from CDN). The server stores language as a metadata field; the client uses it to trigger the right grammar. This keeps the read path simple and stateless — no server-side rendering, no compute for highlighting. The trade-off is that curl-ing a paste returns raw text, which is almost always what developers want anyway.

Forking a paste

Fork = create a new paste with the same content but a new ID and new owner. The write service copies content from the original: for inline pastes, reads the blob from the parent row and writes it inline in the new row; for S3 pastes, issues an S3 server-side copy (no data moves across the network — S3 handles it internally, billing at copy cost not data transfer). The fork then proceeds as a normal write. This is cheap and correct.

vi. Bottlenecks by Tier

TierPractical ceilingBinding resourceLift
CDNVendor-defined; effectively unlimited for reasonable trafficEgress cost / contract limitIncrease edge PoP coverage; negotiate egress pricing; use Anycast correctly so traffic is routed to closest PoP
Read service (app pods)~2,000 concurrent S3 requests / pod (Go goroutine limit before connection pool starvation)S3 connection pool exhaustion; open file descriptorsIncrease pod count; tune S3 connection pool size; use HTTP/2 multiplexing to S3
Redis~100K ops/sec/node; ~26 GB memory/node before eviction pressureMemory (metadata + inline content fills fast)Shard more aggressively; cache only metadata for large pastes (not the bytes); tune maxmemory-policy to volatile-lru so expiring pastes evict first
Cassandra (reads)~20K reads/sec per node with LOCAL_ONE consistency; drops to ~8K at LOCAL_QUORUMCPU for deserialization; SSTable read IOPSAdd read replicas; serve reads at LOCAL_ONE (we accept stale by seconds, not minutes); tune bloom filter FP rate to reduce disk seeks
Cassandra (writes)~20K writes/sec per node (Cassandra is write-optimized via LSM)Compaction I/O stealing from reads at high write QPSUse LeveledCompactionStrategy (LCS) which trades more compaction CPU for fewer SSTables and more predictable read latency; add nodes horizontally
S3 (GETs)~5,500 GET requests/sec/prefix by default; higher with prefix shardingRequest rate per prefix (S3 internal partitioning)Distribute content_keys across multiple prefixes (e.g., first 2 chars of paste_id as prefix: aB/aB3kZ9m); S3 auto-scales within a prefix after ~30 min of sustained traffic
S3 (PUTs)~3,500 PUT requests/sec/prefixSame prefix partitioningSame prefix sharding strategy; use multipart upload for pastes > 5 MB
Expiry worker~1,000 deletes/sec sustainedS3 DELETE throughput + Cassandra write IOPSParallelize worker threads; use S3 batch operations for bulk deletes (up to 1,000 objects per request)

vii. Hot Keys / Skew / Pathological Data

The skew profile here is more extreme than in the URL shortener. A paste shared in a viral tweet or a Hacker News "Show HN" comment can go from 0 to 50,000 reads in 60 seconds. The paste ID becomes a hot key across every tier simultaneously: CDN, Redis, S3, and Cassandra all see it at once.

Mitigations

viii. Multi-Region Architecture

Routing layer

Anycast geo-DNS routes readers to the nearest region. Pastebin reads are entirely region-local: the CDN serves from edge; cache misses hit the regional read service and regional Cassandra replica. No cross-region hop on the read path.

Writes are slightly different. Anonymous pastes are created in the nearest region. Authenticated user pastes are created in the user's "home region" (determined at account creation) to simplify the pastes_by_owner lookup. If a user in Singapore creates an account, their pastes live in the AP region's Cassandra. Reads from the US get a cross-region hop for cache misses — acceptable for a write-once-read-many workload where the vast majority of reads are already CDN hits.

What is region-local vs. globally coordinated

ConcernRegion-localGlobally coordinated
Paste content readsYes — CDN edge or regional read serviceNo
Paste content bytes (S3)Cross-region replication via S3 CRR; reads serve from nearest region bucketWrite lands in origin region; replication async
Metadata (Cassandra)Regional RF=3 cluster with async cross-region replicationCustom alias creation uses LWT in home region
Custom alias uniquenessNo — must be globally uniqueYes — LWT in a designated "alias arbiter" region; rare operation, latency acceptable
View countIncremented regionallyAggregated asynchronously into a global counter; eventual consistency is fine for view counts
Burn-after-readAttempted in originating regionCross-region invalidation (CDN purge is global by default)

RPO / RTO matrix

Failure scopeImpactRPORTO
Single AZ down (within region)Reduced capacity; Cassandra RF=3 survives 1 AZ loss0 (no data loss)< 30s (LB removes unhealthy pods)
Full region downTraffic re-routed to nearest healthy region via geo-DNS TTL< 5 min for async-replicated content~2–5 min (geo-DNS TTL)
S3 region outageCache miss reads fail; S3 CRR allows failover to replica bucket~15 min replication lag (typical CRR lag)< 10 min (update endpoint to replica bucket)
Metadata DB corruption (logical)Paste metadata unreadablePoint-in-time Cassandra snapshots every 6h; max 6h data lossHours — Cassandra restore is slow; prioritize replaying write-ahead log
Expiry worker outageExpired pastes remain readable until read-time check catches them; no data integrity loss0 (read-time enforcement still works)Worker restarts automatically; backlog processed within hours

ix. Failure Modes & Mitigations

Infrastructure failures

FailureBlast radius / symptomMitigation
App pod crashIn-flight requests fail; LB health-check removes pod within 5sMinimum 3 pods per region; LB health-check at 2s interval; circuit breaker in LB
Redis shard downCache misses for that shard's key space; latency spikes as requests fall through to S3/DBRead service degrades gracefully to DB+S3; Redis cluster auto-promotes replica in < 30s; size DB tier to absorb full load
Redis cluster full (OOM)Evictions begin; hit rate drops; DB and S3 absorb the load increasevolatile-lru eviction policy targets expiring keys first; alert at 75% memory used; add shard before hitting 90%
Cassandra node downRF=3 with LOCAL_QUORUM writes: 1 node loss is transparent; 2 node loss: writes fail (quorum unavailable)3-node minimum per region; RF=3; alert on node down immediately; replace within 4h (SLA)
S3 throttling (503 Slow Down)Large paste reads and writes fail; app retries with exponential backoff + jitterPrefix sharding to spread request rate; retry with jitter (1s base, 2× up to 30s, 3 attempts); circuit breaker after 5 consecutive failures
S3 region outageCache misses can't resolve for large pastesS3 Cross-Region Replication (CRR); update read service endpoint to replica bucket via config flag; test failover quarterly
CDN PoP outageTraffic falls through to origin; origin must handle full CDN-bypass loadMulti-CDN via geo-DNS health checks (Fastly + CloudFront failover); size origin for CDN-down case
Region-wide outageAll traffic for that region must rerouteGeo-DNS with low TTL (60s); cross-region Cassandra replication; S3 CRR; runbook for failover activation

Data-path pathologies

FailureBlast radius / symptomMitigation
Cache stampede (viral paste)CDN TTL expires on popular paste; thousands of simultaneous cache misses hit originstale-while-revalidate at CDN; singleflight in-process; distributed mutex at Redis; adaptive CDN TTL extension for hot keys
Hot key on Redis shardSingle shard CPU-bound; latency for all keys on that shard degradesIn-process L1 LRU absorbs top-N keys; pre-signed S3 URL redirect bypasses Redis for large pastes; key splitting not needed (paste IDs are globally unique — no synonyms to fan-out)
Hot partition on CassandraNode serving that paste_id's partition sees disproportionate read load99%+ of reads hit CDN or Redis; Cassandra should almost never see hot-paste traffic directly
Replication lag (Cassandra cross-region)Reader in EU reads older content than writer in US; typical lag < 500msAcceptable for pastebin; document in SLO as "eventual consistency across regions, <1s typical"; if correctness is critical, write to the reader's region via home-region routing
Poison paste (write succeeds, S3 object corrupted)Readers get garbled bytes; no silent corruption — S3 returns ETag mismatch on read with integrity checks enabledEnable S3 server-side checksum (SHA256); read service validates checksum before serving; serve 500 and alert on mismatch
Partial write (DB row written, S3 PUT failed)Paste exists in metadata but content is absent; read returns 500 or empty bodyWrite service: S3 PUT first, then DB write. If S3 PUT fails, return error — no DB row created. If DB write fails after S3 PUT, a background reconciler finds DB-absent S3 objects and either retries the DB write or deletes the orphaned S3 object.
Expiry worker falling behindExpired pastes remain in S3 / DB longer than expected; read-time check still enforces correctnessMonitor expiry queue depth; auto-scale worker pods on queue depth; S3 lifecycle rules as backstop
Clock skew between nodesexpires_at enforcement inconsistent across nodes; a paste may be readable on one pod and expired on anotherUse NTP with <100ms tolerance; treat expires_at as a soft boundary with 5s grace on the read side; burn-after-read uses Cassandra LWT (Paxos) not wall clock
Cassandra compaction stormSustained high write QPS triggers compaction; read latency spikes as compaction steals I/OLeveledCompactionStrategy minimizes read amplification; throttle compaction throughput via compaction_throughput_mb_per_sec; add nodes to spread compaction load

Operational / deployment failures

FailureBlast radius / symptomMitigation
Bad deploy (read service)New pods return 5xx on paste fetchCanary deploy at 5% traffic; auto-rollback on error rate > 1% sustained 2 min; blue-green for schema-breaking changes
Cassandra schema migrationAdding a column is safe (Cassandra handles it); renaming or changing type is notAlways additive: new column alongside old; dual-write during migration; remove old column only after all readers on new schema; never rename in place
S3 bucket misconfiguration (public access)All paste content publicly readable without authBlock public access at bucket level; all access via pre-signed URLs or app-layer auth; quarterly S3 policy audit; AWS Macie for sensitive data detection
Expiry worker runaway (deletes live pastes)Active pastes deleted prematurelyWorker reads expiry only from pastes_by_expiry; cross-checks expires_at in main table before deleting; 5-second delay between DB read and S3 delete as a sanity window
ID generator pod restart mid-rangeUnused IDs in the claimed range are wasted — not a correctness problem, but a small key space leakRanges are 1M IDs; losing a partial range wastes at most ~1M IDs out of 3.5T — negligible; log range claim/return at pod startup/shutdown for audit

x. Security & Abuse

Threat model

Attack vectorRiskMitigation
Malware / phishing content hostingHigh — short URLs and anonymous pastes are ideal phishing vehiclesAsync content scan at write time (ClamAV for binaries; heuristic URL scanner for embedded links); Google Safe Browsing API check on URLs in paste content; flag for human review above a suspicion threshold; don't block writes synchronously — scan in background and tombstone within seconds of a positive hit
CSAM / illegal contentCritical — legal liability; platform must not hostPhotoDNA hash check for image content (text pastes: keyword signal + human review pipeline); hard delete within 60s of detection; preserve evidence hash for law enforcement in append-only audit log; legal hold prevents S3 lifecycle deletion
ID enumerationMedium — attacker iterates IDs to harvest private pastesIDs are base62 7-char (3.5T space); rate-limit 404s aggressively (10 per minute per IP / token); use non-sequential ID generation (shuffle counter range before encoding) to make sequential enumeration worthless; private pastes still require knowing the ID — obscurity is not security, but it raises the cost of enumeration significantly
Credential / secret dumpingHigh — pastes are a common accidental leak surface for API keys, passwords, private keysRegex scan on write for known secret patterns (AWS keys, GitHub tokens, private key headers); flag and notify owner if authenticated; for anonymous pastes, tombstone and alert security team; integrate with GitHub secret scanning partner program
DDoS on write endpointMedium — 350 QPS peak write is modest; DDoS can overwhelm it easilyRate-limit writes per IP (10/min anonymous, 100/min authenticated); CAPTCHA at write for anonymous users on abuse signals; Cloudflare / Akamai DDoS protection at edge; WAF rule for request size > 10 MB
DDoS on single paste (read amplification)High — attacker shares URL broadly; origin gets slammedCDN absorbs read DDoS for public pastes; rate-limit reads per IP at edge (1,000/min); auto-block IP on anomalous read rate; for private pastes behind auth, the auth wall is the rate limit
Redirect loop / open redirect via paste contentLow — pastebin doesn't redirect on content; content is served rawN/A — no redirect functionality for content; only the short URL → paste page is a redirect, which is the intended behavior
GDPR / right to erasureMedium — EU users can request deletion of their contentDeletion pipeline: tombstone DB row within 30s, S3 delete within 5 min (sync), CDN purge within 5 min; audit log records deletion event but not content; document data residency (content stored in which regions) in privacy policy
Audit / forensicsRegulatory — law enforcement requests for contentAppend-only audit log (CloudWatch Logs + S3 + WORM bucket policy) records: create event, read events (sampled), delete events, abuse flags, legal holds; content itself is in S3 which has object-level versioning; legal hold flag in DB prevents expiry worker from deleting; legal team owns key for WORM bucket
Password brute-forceMedium — password-protected pastes are individually targetableRate-limit wrong password attempts: 5 per paste_id per 15 minutes per IP; after 3 failures, serve a CAPTCHA; bcrypt cost factor 12 (300ms on server) makes brute force expensive; lock paste_id after 10 failures from different IPs

Takedown propagation flow

Detection → flag in DB (status = 'tombstoned') → read service serves 451 (legal / abuse takedown) → purge CDN edge cache via API (propagates to all PoPs in < 10 seconds) → delete Redis cache → S3 delete or legal hold (depending on reason) → audit log entry. Total time from detection to edge cache cleared: < 60 seconds.

xi. Observability & SLOs

SLI targets (28-day rolling window)

SLITargetError budget / month
Paste read availability (HTTP 2xx/3xx ratio, excluding 404/410)99.99%~4.3 minutes
Read p50 latency (time-to-first-byte, server-side, post-CDN)< 20 ms
Read p99 latency (TTFB, server-side)< 50 ms
Read p99.9 latency (TTFB)< 200 ms
Paste write availability99.9%~43 minutes — writes are less critical than reads
Write p99 latency (create paste, end-to-end)< 500 ms— (includes S3 PUT; 95th pct < 200ms)
Expiry correctness (expired paste serves 404 within 5s of TTL)99.9%Read-time check is the enforcement; 0.1% failure budget for clock skew edge cases
Content durability99.999999999% (S3 eleven-nines)Inherited from S3; Cassandra RF=3 provides independent metadata durability

Golden signals — per service

Key alerts (burn-rate rules)

Tracing & debugging

xii. Key Takeaways

The moment the payload outgrows the row, object storage becomes the architecture.
  1. Split metadata from content. Metadata (IDs, expiry, owner, size) belongs in a fast, indexed database. Content bytes belong in object storage. Mixing them — putting 10 MB blobs in Cassandra rows — collapses both tiers simultaneously under load.
  2. The inline threshold is a first-class decision. Storing small pastes (<4 KB) inline in the DB row eliminates an entire network hop for the majority of pastes. Tune this threshold based on observed size distribution, DB row limits, and S3 GET latency. It is a config value, not hard code.
  3. Write order matters for consistency. S3 PUT before DB write. A failed S3 PUT leaves no trace — the paste simply doesn't exist. A DB write before a failed S3 PUT leaves a dangling metadata row with no content, which is harder to clean up and confusing to read-path code.
  4. Expiry is a pipeline, not a flag. Read-time enforcement is the correctness guarantee. The async worker is the storage reclamation mechanism. S3 lifecycle rules are the backstop. All three must exist; each compensates for the others' failure modes.
  5. Pre-signed S3 URLs are the bandwidth escape hatch. For large pastes, redirect the client directly to S3. The app tier serves a 307 — not bytes. This keeps app pods stateless and thin, offloads bandwidth entirely to S3's own infrastructure, and is free to implement.
  6. Object storage has throughput ceilings per prefix. S3 allows ~3,500 PUT/s and ~5,500 GET/s per prefix. At low QPS this is invisible. At scale, prefix sharding (first 2 chars of the key) multiplies this ceiling by the number of prefixes. Design it in early — retrofitting prefix sharding onto an existing key scheme is painful.
  7. Burn-after-read is eventually consistent by design. A race between two simultaneous readers means one gets the content and one gets a 404. Document this, set user expectations, and don't try to make it atomic with distributed locking — the cure is worse than the disease.
  8. Abuse is load-bearing, not optional. Pastebin is one of the most abused infrastructure primitives on the internet. The content scanning pipeline, rate limiting, and takedown flow must be built before launch, not after the first incident.
  9. The read path is shaped by bytes, not requests. 3,500 QPS sounds modest until you multiply by 10 KB average and realize you need ~280 Mbps sustained at peak — without CDN. Architecture everything around the byte rate, not the request rate. CDN is not optional; it is load-bearing infrastructure.
  10. Observability at the content_type boundary. Split all metrics by whether the paste is inline or S3. A latency regression that affects only S3 pastes (large content) looks like a p99 spike but leaves p50 fine — you will miss it without the split. This is the trace attribute that pays for itself on the first incident.

xiii. Go Deeper