SYSTEM DESIGN #03 · INTERVIEW GUIDE
Cross-Platform Identity Resolution System
A user browses your site on their work laptop, clicks an ad on their phone, and converts on a shared family tablet — three devices, one person, zero connection without a robust identity resolution system. This system is the backbone of every accurate attribution, personalisation, and fraud detection pipeline at scale: it deterministically links identities using hashed email and first-party cookies, and falls back to probabilistic fingerprint matching (cosine similarity ≥ 0.85) for anonymous sessions. The core data structure is a path-compressed Union-Find graph backed by Amazon Neptune, with hot canonical-ID lookups cached in Redis for sub-millisecond response times. After this guide, you will know how to design a privacy-safe, GDPR-compliant identity graph that stitches 1B+ user profiles without ever storing raw PII.
Union-FindAmazon NeptuneRedisHMAC-SHA256GDPR
💡
The Gist — What Problem Are We Solving?
Recognising the same person across all their devices
A user sees an ad on their phone, searches on a laptop, and converts on a tablet. Without identity resolution those look like 3 different people. This system links cookies, hashed emails, and mobile IDs into one canonical profile, enabling accurate attribution across every touchpoint while staying GDPR-compliant.
💬Think of it as a detective that realises the phone, laptop, and tablet user are all the same person — and updates every record accordingly.
These are the capabilities the system must deliver — what users and operators can actually do with it.
🔗
Signal Ingestion
🔗Accept browser cookie, IDFA/GAID, hashed email, loyalty ID, CRM ID
🧩
Deterministic Matching
🧩Exact match on hashed email or authenticated user ID
🎲
Probabilistic Matching
🎲Cosine similarity ≥0.85 on device fingerprint vector
🔐
Privacy and Consent
🔐Consent gate before any PII processing; GDPR erasure within 30 days
📡
Serving API
📡<50ms lookup: signal → canonical_id; profile API for full identity cluster
⚡
Non-Functional Requirements
These define how well the system must perform — the quality attributes that separate a toy from a production system.
⚡ Lookup Latency
⚡<50ms p99 for signal → canonical_id resolution
📈 Scale
📈100M+ unique users; 1B+ identity edges
⏱️ Graph Update
⏱️<5s from new signal to graph propagation
🛡️ Privacy
🛡️HMAC-SHA256 all PII; zero plain-text storage; consent-gated
📅 Erasure SLA
📅GDPR erasure propagated to all systems within 30 days
📊
Key Metrics — The Numbers That Define This System
The headline numbers to know cold — and be ready to explain how each one is achieved.
🏗️
System Architecture Diagram
Full data flow from source to serving. Each layer scales independently.
Ingestion
→
→
Resolution Service
Union-Find
→
Identity Graph
Neptune/Redis
↓
Processing
Canonical ID Store
Postgres
→
Identity API + Merge Webhooks
🗺️
End-to-End User Journey
Trace a single request end-to-end — the story interviewers want you to tell fluently.
1
New browser signal
— First visit: cookie_id generated, stored in Redis with 365-day TTL; graph node created
2
Email captured
— User fills checkout form; HMAC-SHA256 hashed email ingested; deterministic match attempted
3
Match found
— Hash matches existing node → Union-Find UNION → canonical_id assigned
4
Graph updated
— Neptune edge created; Postgres canonical store updated; webhook fired to downstream
🔭
High-Level Design — Component Breakdown
Core components — each with a single, well-defined responsibility. The key architectural insight: each layer scales independently, and failure in one component is isolated from the rest.
1 — Signal API
Handles responsibilities for the Signal API layer. Designed for independent horizontal scaling — additional instances added without architectural changes. Communicates asynchronously with adjacent components to maximise throughput and fault isolation.
2 — Kafka
Distributed event bus with RF=3 for durability. Partitioned by user_id_hash for per-user ordering. LZ4 compression reduces storage cost by 60%. Exactly-once semantics via idempotent producers and transactional consumers.
3 — Resolver
Handles responsibilities for the Resolver layer. Designed for independent horizontal scaling — additional instances added without architectural changes. Communicates asynchronously with adjacent components to maximise throughput and fault isolation.
4 — Neptune
Managed graph database for identity graph and relationship queries. Property graph model. Millisecond queries for 1-hop traversals; batch Gremlin queries for multi-hop analysis. Sync’d from Redis Union-Find every 5 minutes.
5 — Postgres
Relational source of truth for transactional data. ACID guarantees for order records, account state, and financial ledgers. Read replicas serve analytics queries; connection pooling via PgBouncer.
6 — Redis
In-memory data structure server handling hot-path lookups in <1ms. Used for: canonical ID cache, rate limiting (token buckets), session state, feature store, and leaderboards. Cluster mode with 6 shards for horizontal scale.
🔬
Low-Level Design — Deep Dives
Deep dives worth explaining in detail in any senior engineering interview. For each: know the data structure, the algorithm, the why, and the trade-off you made.
1 — Signal Ingestion API
Deterministic + Probabilistic
REST API accepts identity signals: hashed_email, first_party_cookie_id, device_fingerprint, mobile_advertising_id. Each signal type has a confidence weight: hashed_email=1.0, 1p_cookie=0.9, MAID=0.85, fingerprint≥0.85_cosine=0.7. Signals published to Kafka identity_signals topic for async graph updates. API responds in <20ms — synchronous Redis canonical_id lookup only; graph update is async.
POST /v1/identity/signal
{
“signal_type”: “hashed_email”,
“value”: “sha256:a3f…”,
“context”: {“session_id”: “…”, “consent”: true}
}
2 — Graph Merge Worker
Union-Find · Redlock
Kafka consumer reads identity_signals and performs merge operations on the Union-Find graph stored in Redis. Concurrent merges for the same canonical_id are serialised using Redlock (3-node quorum, 100ms TTL). Merge logic: find(signal_a) → root_a, find(signal_b) → root_b. If root_a != root_b and neither exceeds cluster size limit (10K nodes): union(root_a, root_b). Post-merge: invalidate canonical_id cache entries for affected roots.
with redlock.lock(f’merge:{min(a,b)}:{max(a,b)}’):
root_a = union_find.find(a)
root_b = union_find.find(b)
if root_a != root_b:
union_find.union(root_a, root_b)
3 — Canonical ID Cache
Redis · Sub-ms Lookup
Hot canonical_id mappings cached in Redis hash: HSET canonical_ids {signal_hash: canonical_id}. TTL=24h, refreshed on access. Cache miss: query Neptune graph DB (50-100ms) to resolve and backfill cache. Cache hit rate: ~95% for attribution hot path (most signals seen recently). Canonical_id is a ULID — lexicographically sortable, collision-free at scale.
def get_canonical(signal_hash):
cid = redis.hget(‘canonical_ids’, signal_hash)
if not cid:
cid = neptune.resolve(signal_hash)
redis.hset(‘canonical_ids’, signal_hash, cid)
redis.expire(f’canonical:{signal_hash}’, 86400)
return cid
4 — Neptune Graph Store
Durable Source of Truth
Amazon Neptune stores the full identity graph as a property graph. Vertices: identity nodes with properties {signal_type, confidence, created_at, consent}. Edges: LINKED_TO with properties {merge_reason, merged_at}. Queries: find all signals for a canonical_id (1-hop traversal), find cluster size (degree query). Neptune synced from Redis Union-Find every 5 minutes via a Flink job that reads the Redis keyspace change stream.
g.V().has(‘canonical_id’, cid)
.out(‘LINKED_TO’)
.values(‘signal_hash’, ‘signal_type’, ‘confidence’)
.toList()
⚖️
Trade-offs & Decision Log
Every senior interview comes down to these decisions. Know the exact trade-off, the reasoning, and the specific numbers that justify each choice.
⚖️ Union-Find vs Graph DB as Primary Identity Store
✓
Union-Find + Redis ✅ Chosen
- O(α) ≈ O(1) amortised find() — sub-millisecond lookups
- Path compression reduces memory by >80%
- Horizontally shardable by user_id_hash prefix
- Complex to persist durably — requires Redis AOF + periodic Neptune sync
→
Neptune / Neo4j as sole store
- Natural graph traversal for complex identity queries
- Built-in ACID transactions across merge operations
- 50-100ms per lookup — too slow for real-time attribution
- Expensive at billion-node scale
💡Decision: Union-Find in Redis for hot-path lookups; Neptune as durable source of truth; sync every 5min
⚖️ Eager vs Lazy Identity Merging
✓
Eager (merge on every event) ✅ Chosen
- Identity graph always consistent — attribution sees unified view
- New touchpoints immediately linked to canonical ID
- Higher write amplification on merge events
- Requires distributed lock (Redlock) for concurrent merges
→
Lazy (merge at query time)
- Lower write cost — no merge work during ingestion
- Attribution query must resolve identity on-the-fly
- Query latency spikes on large identity clusters
- Stale reads possible if query cache not invalidated
💡Decision: Eager merging for identity resolution; batch reconciliation job nightly catches any lazy edge cases
🎯Interview Questions — Answered
The exact questions interviewers ask — with production-grade answers
Q1
How do you prevent identity merges from creating huge ‘super-clusters’?
Super-cluster formation (where 10M users merge into one canonical ID via transitive links) is prevented by two controls: (1) Merge confidence threshold — probabilistic matches require cosine similarity ≥0.85, not just ≥0.5, to avoid low-confidence chains. (2) Cluster size cap — clusters exceeding 10,000 nodes are flagged for manual review; automatic merges are paused. This catches shared-device scenarios (library computers, call-centre PCs) where one device fingerprint would otherwise link millions of unrelated users. Deterministic merges (hashed email) bypass the size cap since they are 100% accurate.
Q2
How is the identity graph made GDPR-compliant?
Four mechanisms: (1) No raw PII stored — only HMAC-SHA256(email, secret_key) hashes; unhashing is computationally infeasible. (2) Right to erasure — delete canonical_id from Union-Find and Neptune; orphaned nodes auto-expire via TTL. (3) Consent flag on every node — nodes without consent=true are excluded from probabilistic merging. (4) Data minimisation — identity graph stores only the minimum: canonical_id, signal_type, confidence_score, created_at. No behavioural data stored in the identity system itself.
Q3
What is the update latency from new signal to canonical ID resolution?
For deterministic signals (hashed email): <500ms end-to-end — event hits Kafka, Flink consumer processes the merge, Union-Find updated in Redis, canonical_id available for next event. For probabilistic signals: up to 5 minutes — fingerprint vector must be scored against existing clusters (cosine similarity batch), which runs every 5 minutes. Hot-path attribution always uses the most recent canonical_id from Redis cache (24h TTL). Stale canonical_ids affect <0.1% of attribution windows given the 5-minute correction lag.
System Design Series · Every Tuesday & Thursday
Level up your system design interviews
Each post covers Gist, Functional & Non-Functional Requirements, Key Metrics, System Diagram, User Journey, HLD, LLD, and Trade-offs & FAQs.
Subscribe to never miss a post →
Previous Articles
Categories: System Design
Tags: cross-device, graph database, identity resolution, interview prep, privacy, system design, union-find
Leave a Reply