Design Facebook Messenger / WhatsApp

1. What is a Chat System?

A chat system allows users to send text, images, and videos to each other in real-time. It supports 1-on-1 conversations and Group chats.

Real-World Examples

  • WhatsApp / Signal: Store messages on device (mostly). End-to-end encrypted. Architecture is “Store-and-Forward”.
  • FB Messenger / Slack: Store messages on server. Multi-device sync is critical. We will focus on this Cloud-Based architecture.

Try it yourself

Open WhatsApp Web and your phone. Send a message from your phone. It appears on the Web instantly. Turn off your phone’s internet. The Web version says “Phone not connected” (for WhatsApp) or continues working (for Messenger). Why?


2. Requirements & Goals

Functional Requirements

  • 1-on-1 Chat: Low latency delivery.
  • Group Chat: Up to 500 members.
  • Presence: “Online”, “Last Seen”, “Typing…”.
  • Receipts: Sent, Delivered, Read.
  • Multi-Device: Sync messages across phone and desktop.

Non-Functional Requirements

  • Low Latency: Real-time experience is key (< 100ms).
  • Consistency: Messages must appear in order (FIFO).
  • Availability: High.
  • Security:
    • TLS: Encryption in transit.
    • E2E (End-to-End): Using the Signal Protocol (Double Ratchet Algorithm). Keys are stored on user devices, not servers. The server only sees encrypted blobs.

3. Capacity Estimation

  • DAU: 2 Billion.
  • Messages: 100 Billion per day.
  • Storage:
    • Avg msg size = 100 Bytes.
    • 100B * 100B = 10TB / day.
    • 5 Years = 18 PB.
    • We need a massive, write-heavy database.

4. System APIs

We need a mix of REST (for profile/auth) and WebSocket (for chat).

REST API (HTTP) | Method | Endpoint | Description | |:——–|:——–|:——–| | POST | /v1/chat/create | Start a new chat (returns chat_id). | | POST | /v1/group/add_member | Add user to group. |

WebSocket Events (Bidirectional)

  • sendMessage(chat_id, content)
  • receiveMessage(sender_id, content)
  • userStatusChanged(user_id, status)

5. Database Design

We need a database that handles extremely high write throughput and efficient range queries (to load chat history).

MySQL?

  • Possible, but index maintenance becomes heavy at 100B writes/day.

HBase / Cassandra (Wide Column Store)

  • Winner (labeled in diagram).
  • Key: chat_id (Partition Key).
  • Clustering Key: message_id (Sort Key, Snowflake).
  • Value: Message content.
  • Allows very fast “Get me the last 50 messages for Chat 123” queries.

6. High-Level Design

Real-Time Message Routing Architecture.

System Architecture: Real-Time Chat System
WebSocket | Stateful Chat Servers | Service Discovery (Zoom Keeper)
WebSocket Connection
Message Route
Discovery / Async
Clients
Chat Cloud
Storage & Discovery
Sender
Receiver
Load Balancer
Chat Server 1
Stateful Connection
(User A)
Chat Server 2
Stateful Connection
(User B)
Service Discovery
Etcd / Zookeeper
Maps User -> Server
HBase / Cassandra
Wide Column Store
Message Sync Log
"Where is User B?" Forward Msg Persist (Async)

7. Component Design: Connection Protocols

How do we keep the connection open? See Polling vs Push.

A. Polling (The Old Way)

  • Client asks “Any new messages?” every 1 second.
  • Pros: Simple HTTP.
  • Cons: Wasted resources. Server load is high even if no one is talking. High latency.

B. Long Polling

  • Client asks “Any new messages?”. Server holds the connection open until a message arrives (or timeout).
  • Pros: Less load than polling.
  • Cons: Connection setup overhead is still there.

C. WebSockets (The Modern Way)

  • Bidirectional, persistent TCP connection.
  • Server can push to Client instantly.
  • Pros: Lowest latency, lowest overhead.
  • Cons: Needs stateful servers (Server must know who is connected to it).

8. Group Chat Optimization

Group chats introduce a “Fanout” problem.

Small Groups (< 500 members)

  • Push Model: When User A sends a message, the server loops through all 500 members, finds their WebSocket connections, and pushes the message.
  • Why: 500 lookups is fast.

Large Groups / Channels (> 5000 members)

  • Pull Model (or Hybrid): We don’t push to everyone.
  • Online users might be listening to a Pub/Sub channel (e.g., Kafka topic).
  • Inactive users will just fetch the history when they open the app next time.

9. Message Synchronization & Reliability

How do we ensure messages are delivered in order and synced across devices?

Sequence IDs & Multi-Device Sync

We cannot rely on timestamp (clock skew). We use Sequence IDs per chat.

The Sync Protocol:

  1. State: Each device (Phone, Laptop) maintains a local LastReadID.
  2. Reconnect: When a device connects, it sends SYNC(LastReadID).
  3. Fetch: Server queries HBase/Cassandra: SELECT * FROM msgs WHERE chat_id=123 AND msg_id > LastReadID.
  4. Forward: Server pushes missing messages to that specific socket.

The Sequence ID Solution

Why not just use time? Because server clocks drift.

  • Logical Clocks: A Sequence ID is a counter (1, 2, 3…) unique to a Chat Room.
  • Implementation: This is tricky. You can’t use a global counter (slow).
  • Solution: Since one person writes at a time (usually), the client can propose a temp ID, and the server assigns the final authoritative ID.
  • K-V Store: We store Current_Max_ID in Redis for each Chat ID. INCR is atomic.

Active Session Sync:

  • If User A is online on both Phone and Laptop.
  • The Chat Server detects two active WebSocket connections for UserA.
  • When a message arrives for UserA, the server fans out the message to all active sockets.
  • Both devices receive the message instantly.
Client State: ID: 5
--- Sync Request (ID: 5) --->
Server DB: [6, 7, 8]

Presence Service

  • Heartbeat: Client sends a heartbeat every 5s over the WebSocket.
  • Redis: Presence Service updates “Last Seen” in Redis with TTL = 10s.
  • If Heartbeat stops, TTL expires -> User is Offline.
  • Fanout: When User A comes online, we fanout this status to all their friends (Pub/Sub).

10. Interactive Decision Visualizer

Polling vs WebSocket Simulator

This simulator shows the network overhead of checking for new messages.

  • Polling: Creates a new HTTP connection every second. High overhead.
  • Long Polling: Holds connection open.
  • WebSocket: Keeps a single TCP connection open. Near-zero overhead.

HTTP Polling 🐢

Client
Server
Requests: 0
Overhead: High (HTTP Headers)

WebSocket ⚡

Client
Server
Requests: 1
Overhead: Low (Persistent)

11. Requirements Traceability

Requirement Design Decision Justification
Real-Time Delivery WebSockets Only persistent connections can achieve <100ms latency.
Message Ordering Sequence IDs Time-sortable IDs (Snowflake) per chat ensure FIFO.
Scalability (Storage) HBase / Cassandra Wide-column stores handle billions of small writes efficiently.
Group Chat Hybrid Fanout Push for small groups, Pull for large channels.
Presence Redis + Heartbeat Ephemeral keys with TTL map perfectly to “Online Status”.

12. Observability (RED Method)

  • Rate: Messages Sent per Second.
  • Errors: Failed Message Deliveries / WebSocket Disconnects.
  • Duration: End-to-End Latency (Sender -> Server -> Receiver).

Key Metrics:

  • active_connections: Number of open WebSockets per server (Capacity Planning).
  • message_queue_depth: If using Kafka for async tasks.

13. Deployment Strategy

  • Connection Draining: When deploying a new Chat Server, we cannot just kill the old one (it has active WebSockets). We must stop accepting new connections and wait for old ones to close (or force reconnect).
  • Blue/Green: Essential for zero-downtime updates of stateful services.

14. Interview Gauntlet

Q1: How do you handle “Read Receipts” in a Group of 500?

  • Answer: We do NOT send a separate event for every read. We aggregate them. The client sends “Read up to ID 100”. The server updates the “Read Watermark” for that user in that group.

Q2: What happens if a user is offline?

  • Answer: The message is stored in the DB (HBase). When the user connects, they send their LastReadID. The server queries HBase for WHERE chat_id = X AND msg_id > LastReadID.

Q3: How do you store images?

  • Answer: Never in the WebSocket. Upload to S3 (HTTP) -> Get URL -> Send URL in WebSocket message.

15. Whiteboard Summary

Real-Time Arch

  • Protocol: WebSockets (Stateful).
  • Discovery: Zookeeper (User -> Server Map).
  • Sync: Sequence IDs (Logical Clock).

Storage & Ops

  • DB: HBase (Write-Heavy, Range Scan).
  • Presence: Redis (Heartbeat TTL).
  • Media: Store-and-Forward (S3).