Design Facebook Messenger / WhatsApp
1. What is a Chat System?
A chat system allows users to send text, images, and videos to each other in real-time. It supports 1-on-1 conversations and Group chats.
Real-World Examples
- WhatsApp / Signal: Store messages on device (mostly). End-to-end encrypted. Architecture is “Store-and-Forward”.
- FB Messenger / Slack: Store messages on server. Multi-device sync is critical. We will focus on this Cloud-Based architecture.
Try it yourself
Open WhatsApp Web and your phone. Send a message from your phone. It appears on the Web instantly. Turn off your phone’s internet. The Web version says “Phone not connected” (for WhatsApp) or continues working (for Messenger). Why?
2. Requirements & Goals
Functional Requirements
- 1-on-1 Chat: Low latency delivery.
- Group Chat: Up to 500 members.
- Presence: “Online”, “Last Seen”, “Typing…”.
- Receipts: Sent, Delivered, Read.
- Multi-Device: Sync messages across phone and desktop.
Non-Functional Requirements
- Low Latency: Real-time experience is key (< 100ms).
- Consistency: Messages must appear in order (FIFO).
- Availability: High.
- Security:
- TLS: Encryption in transit.
- E2E (End-to-End): Using the Signal Protocol (Double Ratchet Algorithm). Keys are stored on user devices, not servers. The server only sees encrypted blobs.
3. Capacity Estimation
- DAU: 2 Billion.
- Messages: 100 Billion per day.
- Storage:
- Avg msg size = 100 Bytes.
- 100B * 100B = 10TB / day.
- 5 Years = 18 PB.
- We need a massive, write-heavy database.
4. System APIs
We need a mix of REST (for profile/auth) and WebSocket (for chat).
REST API (HTTP)
| Method | Endpoint | Description |
|:——–|:——–|:——–|
| POST | /v1/chat/create | Start a new chat (returns chat_id). |
| POST | /v1/group/add_member | Add user to group. |
WebSocket Events (Bidirectional)
sendMessage(chat_id, content)receiveMessage(sender_id, content)userStatusChanged(user_id, status)
5. Database Design
We need a database that handles extremely high write throughput and efficient range queries (to load chat history).
MySQL?
- Possible, but index maintenance becomes heavy at 100B writes/day.
HBase / Cassandra (Wide Column Store)
- Winner (labeled in diagram).
- Key:
chat_id(Partition Key). - Clustering Key:
message_id(Sort Key, Snowflake). - Value: Message content.
- Allows very fast “Get me the last 50 messages for Chat 123” queries.
6. High-Level Design
Real-Time Message Routing Architecture.
(User A)
(User B)
Maps User -> Server
Message Sync Log
7. Component Design: Connection Protocols
How do we keep the connection open? See Polling vs Push.
A. Polling (The Old Way)
- Client asks “Any new messages?” every 1 second.
- Pros: Simple HTTP.
- Cons: Wasted resources. Server load is high even if no one is talking. High latency.
B. Long Polling
- Client asks “Any new messages?”. Server holds the connection open until a message arrives (or timeout).
- Pros: Less load than polling.
- Cons: Connection setup overhead is still there.
C. WebSockets (The Modern Way)
- Bidirectional, persistent TCP connection.
- Server can push to Client instantly.
- Pros: Lowest latency, lowest overhead.
- Cons: Needs stateful servers (Server must know who is connected to it).
8. Group Chat Optimization
Group chats introduce a “Fanout” problem.
Small Groups (< 500 members)
- Push Model: When User A sends a message, the server loops through all 500 members, finds their WebSocket connections, and pushes the message.
- Why: 500 lookups is fast.
Large Groups / Channels (> 5000 members)
- Pull Model (or Hybrid): We don’t push to everyone.
- Online users might be listening to a Pub/Sub channel (e.g., Kafka topic).
- Inactive users will just fetch the history when they open the app next time.
9. Message Synchronization & Reliability
How do we ensure messages are delivered in order and synced across devices?
Sequence IDs & Multi-Device Sync
We cannot rely on timestamp (clock skew). We use Sequence IDs per chat.
The Sync Protocol:
- State: Each device (Phone, Laptop) maintains a local
LastReadID. - Reconnect: When a device connects, it sends
SYNC(LastReadID). - Fetch: Server queries HBase/Cassandra:
SELECT * FROM msgs WHERE chat_id=123 AND msg_id > LastReadID. - Forward: Server pushes missing messages to that specific socket.
The Sequence ID Solution
Why not just use time? Because server clocks drift.
- Logical Clocks: A Sequence ID is a counter (1, 2, 3…) unique to a Chat Room.
- Implementation: This is tricky. You can’t use a global counter (slow).
- Solution: Since one person writes at a time (usually), the client can propose a temp ID, and the server assigns the final authoritative ID.
- K-V Store: We store
Current_Max_IDin Redis for each Chat ID.INCRis atomic.
Active Session Sync:
- If User A is online on both Phone and Laptop.
- The Chat Server detects two active WebSocket connections for
UserA. - When a message arrives for
UserA, the server fans out the message to all active sockets. - Both devices receive the message instantly.
Presence Service
- Heartbeat: Client sends a heartbeat every 5s over the WebSocket.
- Redis: Presence Service updates “Last Seen” in Redis with TTL = 10s.
- If Heartbeat stops, TTL expires -> User is Offline.
- Fanout: When User A comes online, we fanout this status to all their friends (Pub/Sub).
10. Interactive Decision Visualizer
Polling vs WebSocket Simulator
This simulator shows the network overhead of checking for new messages.
- Polling: Creates a new HTTP connection every second. High overhead.
- Long Polling: Holds connection open.
- WebSocket: Keeps a single TCP connection open. Near-zero overhead.
HTTP Polling 🐢
WebSocket ⚡
11. Requirements Traceability
| Requirement | Design Decision | Justification |
|---|---|---|
| Real-Time Delivery | WebSockets | Only persistent connections can achieve <100ms latency. |
| Message Ordering | Sequence IDs | Time-sortable IDs (Snowflake) per chat ensure FIFO. |
| Scalability (Storage) | HBase / Cassandra | Wide-column stores handle billions of small writes efficiently. |
| Group Chat | Hybrid Fanout | Push for small groups, Pull for large channels. |
| Presence | Redis + Heartbeat | Ephemeral keys with TTL map perfectly to “Online Status”. |
12. Observability (RED Method)
- Rate: Messages Sent per Second.
- Errors: Failed Message Deliveries / WebSocket Disconnects.
- Duration: End-to-End Latency (Sender -> Server -> Receiver).
Key Metrics:
active_connections: Number of open WebSockets per server (Capacity Planning).message_queue_depth: If using Kafka for async tasks.
13. Deployment Strategy
- Connection Draining: When deploying a new Chat Server, we cannot just kill the old one (it has active WebSockets). We must stop accepting new connections and wait for old ones to close (or force reconnect).
- Blue/Green: Essential for zero-downtime updates of stateful services.
14. Interview Gauntlet
Q1: How do you handle “Read Receipts” in a Group of 500?
- Answer: We do NOT send a separate event for every read. We aggregate them. The client sends “Read up to ID 100”. The server updates the “Read Watermark” for that user in that group.
Q2: What happens if a user is offline?
- Answer: The message is stored in the DB (HBase). When the user connects, they send their
LastReadID. The server queries HBase forWHERE chat_id = X AND msg_id > LastReadID.
Q3: How do you store images?
- Answer: Never in the WebSocket. Upload to S3 (HTTP) -> Get URL -> Send URL in WebSocket message.
15. Whiteboard Summary
Real-Time Arch
- Protocol: WebSockets (Stateful).
- Discovery: Zookeeper (User -> Server Map).
- Sync: Sequence IDs (Logical Clock).
Storage & Ops
- DB: HBase (Write-Heavy, Range Scan).
- Presence: Redis (Heartbeat TTL).
- Media: Store-and-Forward (S3).