System Design #03: Cross-Platform Identity Resolution System

SYSTEM DESIGN #03 · INTERVIEW GUIDE

Cross-Platform Identity Resolution System

A user browses your site on their work laptop, clicks an ad on their phone, and converts on a shared family tablet — three devices, one person, zero connection without a robust identity resolution system. This system is the backbone of every accurate attribution, personalisation, and fraud detection pipeline at scale: it deterministically links identities using hashed email and first-party cookies, and falls back to probabilistic fingerprint matching (cosine similarity ≥ 0.85) for anonymous sessions. The core data structure is a path-compressed Union-Find graph backed by Amazon Neptune, with hot canonical-ID lookups cached in Redis for sub-millisecond response times. After this guide, you will know how to design a privacy-safe, GDPR-compliant identity graph that stitches 1B+ user profiles without ever storing raw PII.

Union-FindAmazon NeptuneRedisHMAC-SHA256GDPR

💡

The Gist — What Problem Are We Solving?

Recognising the same person across all their devices

A user sees an ad on their phone, searches on a laptop, and converts on a tablet. Without identity resolution those look like 3 different people. This system links cookies, hashed emails, and mobile IDs into one canonical profile, enabling accurate attribution across every touchpoint while staying GDPR-compliant.

💬

Think of it as a detective that realises the phone, laptop, and tablet user are all the same person — and updates every record accordingly.

Functional Requirements

These are the capabilities the system must deliver — what users and operators can actually do with it.

🔗
Signal Ingestion

🔗Accept browser cookie, IDFA/GAID, hashed email, loyalty ID, CRM ID

🧩
Deterministic Matching

🧩Exact match on hashed email or authenticated user ID

🎲
Probabilistic Matching

🎲Cosine similarity ≥0.85 on device fingerprint vector

🔐
Privacy and Consent

🔐Consent gate before any PII processing; GDPR erasure within 30 days

📡
Serving API

📡<50ms lookup: signal → canonical_id; profile API for full identity cluster

Non-Functional Requirements

These define how well the system must perform — the quality attributes that separate a toy from a production system.

⚡ Lookup Latency

⚡<50ms p99 for signal → canonical_id resolution

📈 Scale

📈100M+ unique users; 1B+ identity edges

⏱️ Graph Update

⏱️<5s from new signal to graph propagation

🛡️ Privacy

🛡️HMAC-SHA256 all PII; zero plain-text storage; consent-gated

📅 Erasure SLA

📅GDPR erasure propagated to all systems within 30 days

📊

Key Metrics — The Numbers That Define This System

The headline numbers to know cold — and be ready to explain how each one is achieved.

<50ms
lookup p99
100M+
users
1B+
identity edges
<5s
graph update
30 days
GDPR erasure SLA
🏗️

System Architecture Diagram

Full data flow from source to serving. Each layer scales independently.

Ingestion
Signal APIs

Kafka

Resolution Service
Union-Find

Identity Graph
Neptune/Redis

Processing
Canonical ID Store
Postgres

Identity API + Merge Webhooks

🗺️

End-to-End User Journey

Trace a single request end-to-end — the story interviewers want you to tell fluently.

1
New browser signal

— First visit: cookie_id generated, stored in Redis with 365-day TTL; graph node created

2
Email captured

— User fills checkout form; HMAC-SHA256 hashed email ingested; deterministic match attempted

3
Match found

— Hash matches existing node → Union-Find UNION → canonical_id assigned

4
Graph updated

— Neptune edge created; Postgres canonical store updated; webhook fired to downstream

🔭

High-Level Design — Component Breakdown

Core components — each with a single, well-defined responsibility. The key architectural insight: each layer scales independently, and failure in one component is isolated from the rest.

1
Signal API
2
Kafka
3
Resolver
4
Neptune
5
Postgres
6
Redis
1 — Signal API

Handles responsibilities for the Signal API layer. Designed for independent horizontal scaling — additional instances added without architectural changes. Communicates asynchronously with adjacent components to maximise throughput and fault isolation.

2 — Kafka

Distributed event bus with RF=3 for durability. Partitioned by user_id_hash for per-user ordering. LZ4 compression reduces storage cost by 60%. Exactly-once semantics via idempotent producers and transactional consumers.

3 — Resolver

Handles responsibilities for the Resolver layer. Designed for independent horizontal scaling — additional instances added without architectural changes. Communicates asynchronously with adjacent components to maximise throughput and fault isolation.

4 — Neptune

Managed graph database for identity graph and relationship queries. Property graph model. Millisecond queries for 1-hop traversals; batch Gremlin queries for multi-hop analysis. Sync’d from Redis Union-Find every 5 minutes.

5 — Postgres

Relational source of truth for transactional data. ACID guarantees for order records, account state, and financial ledgers. Read replicas serve analytics queries; connection pooling via PgBouncer.

6 — Redis

In-memory data structure server handling hot-path lookups in <1ms. Used for: canonical ID cache, rate limiting (token buckets), session state, feature store, and leaderboards. Cluster mode with 6 shards for horizontal scale.

🔬

Low-Level Design — Deep Dives

Deep dives worth explaining in detail in any senior engineering interview. For each: know the data structure, the algorithm, the why, and the trade-off you made.

1 — Signal Ingestion API
Deterministic + Probabilistic

REST API accepts identity signals: hashed_email, first_party_cookie_id, device_fingerprint, mobile_advertising_id. Each signal type has a confidence weight: hashed_email=1.0, 1p_cookie=0.9, MAID=0.85, fingerprint≥0.85_cosine=0.7. Signals published to Kafka identity_signals topic for async graph updates. API responds in <20ms — synchronous Redis canonical_id lookup only; graph update is async.

POST /v1/identity/signal
{
“signal_type”: “hashed_email”,
“value”: “sha256:a3f…”,
“context”: {“session_id”: “…”, “consent”: true}
}

2 — Graph Merge Worker
Union-Find · Redlock

Kafka consumer reads identity_signals and performs merge operations on the Union-Find graph stored in Redis. Concurrent merges for the same canonical_id are serialised using Redlock (3-node quorum, 100ms TTL). Merge logic: find(signal_a) → root_a, find(signal_b) → root_b. If root_a != root_b and neither exceeds cluster size limit (10K nodes): union(root_a, root_b). Post-merge: invalidate canonical_id cache entries for affected roots.

with redlock.lock(f’merge:{min(a,b)}:{max(a,b)}’):
root_a = union_find.find(a)
root_b = union_find.find(b)
if root_a != root_b:
union_find.union(root_a, root_b)

3 — Canonical ID Cache
Redis · Sub-ms Lookup

Hot canonical_id mappings cached in Redis hash: HSET canonical_ids {signal_hash: canonical_id}. TTL=24h, refreshed on access. Cache miss: query Neptune graph DB (50-100ms) to resolve and backfill cache. Cache hit rate: ~95% for attribution hot path (most signals seen recently). Canonical_id is a ULID — lexicographically sortable, collision-free at scale.

def get_canonical(signal_hash):
cid = redis.hget(‘canonical_ids’, signal_hash)
if not cid:
cid = neptune.resolve(signal_hash)
redis.hset(‘canonical_ids’, signal_hash, cid)
redis.expire(f’canonical:{signal_hash}’, 86400)
return cid

4 — Neptune Graph Store
Durable Source of Truth

Amazon Neptune stores the full identity graph as a property graph. Vertices: identity nodes with properties {signal_type, confidence, created_at, consent}. Edges: LINKED_TO with properties {merge_reason, merged_at}. Queries: find all signals for a canonical_id (1-hop traversal), find cluster size (degree query). Neptune synced from Redis Union-Find every 5 minutes via a Flink job that reads the Redis keyspace change stream.

g.V().has(‘canonical_id’, cid)
.out(‘LINKED_TO’)
.values(‘signal_hash’, ‘signal_type’, ‘confidence’)
.toList()

⚖️

Trade-offs & Decision Log

Every senior interview comes down to these decisions. Know the exact trade-off, the reasoning, and the specific numbers that justify each choice.

⚖️ Union-Find vs Graph DB as Primary Identity Store


Union-Find + Redis ✅ Chosen
  • O(α) ≈ O(1) amortised find() — sub-millisecond lookups
  • Path compression reduces memory by >80%
  • Horizontally shardable by user_id_hash prefix
  • Complex to persist durably — requires Redis AOF + periodic Neptune sync

Neptune / Neo4j as sole store
  • Natural graph traversal for complex identity queries
  • Built-in ACID transactions across merge operations
  • 50-100ms per lookup — too slow for real-time attribution
  • Expensive at billion-node scale

💡

Decision: Union-Find in Redis for hot-path lookups; Neptune as durable source of truth; sync every 5min

⚖️ Eager vs Lazy Identity Merging


Eager (merge on every event) ✅ Chosen
  • Identity graph always consistent — attribution sees unified view
  • New touchpoints immediately linked to canonical ID
  • Higher write amplification on merge events
  • Requires distributed lock (Redlock) for concurrent merges

Lazy (merge at query time)
  • Lower write cost — no merge work during ingestion
  • Attribution query must resolve identity on-the-fly
  • Query latency spikes on large identity clusters
  • Stale reads possible if query cache not invalidated

💡

Decision: Eager merging for identity resolution; batch reconciliation job nightly catches any lazy edge cases

🎯

Interview Questions — Answered
The exact questions interviewers ask — with production-grade answers

Q1
How do you prevent identity merges from creating huge ‘super-clusters’?

Super-cluster formation (where 10M users merge into one canonical ID via transitive links) is prevented by two controls: (1) Merge confidence threshold — probabilistic matches require cosine similarity ≥0.85, not just ≥0.5, to avoid low-confidence chains. (2) Cluster size cap — clusters exceeding 10,000 nodes are flagged for manual review; automatic merges are paused. This catches shared-device scenarios (library computers, call-centre PCs) where one device fingerprint would otherwise link millions of unrelated users. Deterministic merges (hashed email) bypass the size cap since they are 100% accurate.

Q2
How is the identity graph made GDPR-compliant?

Four mechanisms: (1) No raw PII stored — only HMAC-SHA256(email, secret_key) hashes; unhashing is computationally infeasible. (2) Right to erasure — delete canonical_id from Union-Find and Neptune; orphaned nodes auto-expire via TTL. (3) Consent flag on every node — nodes without consent=true are excluded from probabilistic merging. (4) Data minimisation — identity graph stores only the minimum: canonical_id, signal_type, confidence_score, created_at. No behavioural data stored in the identity system itself.

Q3
What is the update latency from new signal to canonical ID resolution?

For deterministic signals (hashed email): <500ms end-to-end — event hits Kafka, Flink consumer processes the merge, Union-Find updated in Redis, canonical_id available for next event. For probabilistic signals: up to 5 minutes — fingerprint vector must be scored against existing clusters (cosine similarity batch), which runs every 5 minutes. Hot-path attribution always uses the most recent canonical_id from Redis cache (24h TTL). Stale canonical_ids affect <0.1% of attribution windows given the 5-minute correction lag.

System Design Series · Every Tuesday & Thursday

Level up your system design interviews

Each post covers Gist, Functional & Non-Functional Requirements, Key Metrics, System Diagram, User Journey, HLD, LLD, and Trade-offs & FAQs.

Subscribe to never miss a post →


Categories: System Design

Tags: , , , , , ,

Leave a Reply

Discover more from Cloud Wizard Inc.

Subscribe now to keep reading and get access to the full archive.

Continue reading