Design Facebook Messenger / WhatsApp

1. What is a Chat System?

A chat system allows users to send text, images, and videos to each other in real-time. It supports 1-on-1 conversations and Group chats.

Real-World Examples

WhatsApp / Signal: Store messages on device (mostly). End-to-end encrypted. Architecture is “Store-and-Forward”.
FB Messenger / Slack: Store messages on server. Multi-device sync is critical. We will focus on this Cloud-Based architecture.

Try it yourself

Open WhatsApp Web and your phone. Send a message from your phone. It appears on the Web instantly. Turn off your phone’s internet. The Web version says “Phone not connected” (for WhatsApp) or continues working (for Messenger). Why?

2. Requirements & Goals

Functional Requirements

1-on-1 Chat: Low latency delivery.
Group Chat: Up to 500 members.
Presence: “Online”, “Last Seen”, “Typing…”.
Receipts: Sent, Delivered, Read.
Multi-Device: Sync messages across phone and desktop.

Non-Functional Requirements

Low Latency: Real-time experience is key (< 100ms).
Consistency: Messages must appear in order (FIFO).
Availability: High.
Security:
- TLS: Encryption in transit.
- E2E (End-to-End): Using the Signal Protocol (Double Ratchet Algorithm). Keys are stored on user devices, not servers. The server only sees encrypted blobs.

3. Capacity Estimation

DAU: 2 Billion.
Messages: 100 Billion per day.
Storage:
- Avg msg size = 100 Bytes.
- 100B * 100B = 10TB / day.
- 5 Years = 18 PB.
- We need a massive, write-heavy database.

4. System APIs

We need a mix of REST (for profile/auth) and WebSocket (for chat).

WebSocket Events (Bidirectional)

sendMessage(chat_id, content)
receiveMessage(sender_id, content)
userStatusChanged(user_id, status)

5. Database Design

We need a database that handles extremely high write throughput and efficient range queries (to load chat history).

MySQL?

Possible, but index maintenance becomes heavy at 100B writes/day.

HBase / Cassandra (Wide Column Store)

Winner (labeled in diagram).
Key: chat_id (Partition Key).
Clustering Key: message_id (Sort Key, Snowflake).
Value: Message content.
Allows very fast “Get me the last 50 messages for Chat 123” queries.

6. High-Level Design

Real-Time Message Routing Architecture.

System Architecture: Real-Time Chat System

WebSocket | Stateful Chat Servers | Service Discovery (Zoom Keeper)

WebSocket Connection

Message Route

Discovery / Async

Clients

Chat Cloud

Storage & Discovery

Sender

Receiver

Load Balancer

Chat Server 1

Stateful Connection
(User A)

Chat Server 2

Stateful Connection
(User B)

Service Discovery

Etcd / Zookeeper
Maps User -> Server

HBase / Cassandra

Wide Column Store
Message Sync Log

7. Component Design: Connection Protocols

How do we keep the connection open? See Polling vs Push.

A. Polling (The Old Way)

Client asks “Any new messages?” every 1 second.
Pros: Simple HTTP.
Cons: Wasted resources. Server load is high even if no one is talking. High latency.

B. Long Polling

Client asks “Any new messages?”. Server holds the connection open until a message arrives (or timeout).
Pros: Less load than polling.
Cons: Connection setup overhead is still there.

C. WebSockets (The Modern Way)

Bidirectional, persistent TCP connection.
Server can push to Client instantly.
Pros: Lowest latency, lowest overhead.
Cons: Needs stateful servers (Server must know who is connected to it).

8. Group Chat Optimization

Group chats introduce a “Fanout” problem.

Small Groups (< 500 members)

Push Model: When User A sends a message, the server loops through all 500 members, finds their WebSocket connections, and pushes the message.
Why: 500 lookups is fast.

Large Groups / Channels (> 5000 members)

Pull Model (or Hybrid): We don’t push to everyone.
Online users might be listening to a Pub/Sub channel (e.g., Kafka topic).
Inactive users will just fetch the history when they open the app next time.

9. Message Synchronization & Reliability

How do we ensure messages are delivered in order and synced across devices?

Sequence IDs & Multi-Device Sync

We cannot rely on timestamp (clock skew). We use Sequence IDs per chat.

The Sync Protocol:

State: Each device (Phone, Laptop) maintains a local LastReadID.
Reconnect: When a device connects, it sends SYNC(LastReadID).
Fetch: Server queries HBase/Cassandra: SELECT * FROM msgs WHERE chat_id=123 AND msg_id > LastReadID.
Forward: Server pushes missing messages to that specific socket.

The Sequence ID Solution

Why not just use time? Because server clocks drift.

Logical Clocks: A Sequence ID is a counter (1, 2, 3…) unique to a Chat Room.
Implementation: This is tricky. You can’t use a global counter (slow).
Solution: Since one person writes at a time (usually), the client can propose a temp ID, and the server assigns the final authoritative ID.
K-V Store: We store Current_Max_ID in Redis for each Chat ID. INCR is atomic.

Active Session Sync:

If User A is online on both Phone and Laptop.
The Chat Server detects two active WebSocket connections for UserA.
When a message arrives for UserA, the server fans out the message to all active sockets.
Both devices receive the message instantly.

Client State: ID: 5
--- Sync Request (ID: 5) --->
Server DB: [6, 7, 8]

Presence Service

Heartbeat: Client sends a heartbeat every 5s over the WebSocket.
Redis: Presence Service updates “Last Seen” in Redis with TTL = 10s.
If Heartbeat stops, TTL expires -> User is Offline.
Fanout: When User A comes online, we fanout this status to all their friends (Pub/Sub).

10. Interactive Decision Visualizer

Polling vs WebSocket Simulator

This simulator shows the network overhead of checking for new messages.

Polling: Creates a new HTTP connection every second. High overhead.
Long Polling: Holds connection open.
WebSocket: Keeps a single TCP connection open. Near-zero overhead.

HTTP Polling 🐢

Client

Server

Requests: 0

Overhead: High (HTTP Headers)

WebSocket ⚡

Client

Server

Requests: 1

Overhead: Low (Persistent)

11. Requirements Traceability

Requirement	Design Decision	Justification
Real-Time Delivery	WebSockets	Only persistent connections can achieve <100ms latency.
Message Ordering	Sequence IDs	Time-sortable IDs (Snowflake) per chat ensure FIFO.
Scalability (Storage)	HBase / Cassandra	Wide-column stores handle billions of small writes efficiently.
Group Chat	Hybrid Fanout	Push for small groups, Pull for large channels.
Presence	Redis + Heartbeat	Ephemeral keys with TTL map perfectly to “Online Status”.

12. Observability (RED Method)

Rate: Messages Sent per Second.
Errors: Failed Message Deliveries / WebSocket Disconnects.
Duration: End-to-End Latency (Sender -> Server -> Receiver).

Key Metrics:

active_connections: Number of open WebSockets per server (Capacity Planning).
message_queue_depth: If using Kafka for async tasks.

13. Deployment Strategy

Connection Draining: When deploying a new Chat Server, we cannot just kill the old one (it has active WebSockets). We must stop accepting new connections and wait for old ones to close (or force reconnect).
Blue/Green: Essential for zero-downtime updates of stateful services.

14. Interview Gauntlet

Q1: How do you handle “Read Receipts” in a Group of 500?

Answer: We do NOT send a separate event for every read. We aggregate them. The client sends “Read up to ID 100”. The server updates the “Read Watermark” for that user in that group.

Q2: What happens if a user is offline?

Answer: The message is stored in the DB (HBase). When the user connects, they send their LastReadID. The server queries HBase for WHERE chat_id = X AND msg_id > LastReadID.

Q3: How do you store images?

Answer: Never in the WebSocket. Upload to S3 (HTTP) -> Get URL -> Send URL in WebSocket message.

Design Facebook Messenger (Chat)