Design Slack/Discord (Real-Time Messaging)

[!NOTE] This module explores the core principles of Design Slack/Discord (Real-Time Messaging), deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. What is a Real-Time Messaging System?

Building a chat app for 10 users is easy: INSERT into a database and SELECT * every second. Building Slack (Enterprise) or Discord (Gaming Communities) for 10 Million concurrent users is a Distributed Systems masterpiece.

The challenge is not just storing messages; it’s Synchronization.

Real-Time: When I type “Hello”, 50,000 people in the #general channel must see it in < 50ms.
Presence: Knowing exactly who is “Online”, “Idle”, or “Typing…” among millions of users.
Statefulness: Unlike a REST API, the server must maintain a persistent TCP connection (WebSocket) with the client.

[!TIP] Real-World Examples:

Slack: Workplace communication (High reliability, structured channels).

Discord: Voice/Text for communities (Massive scale, ephemeral voice channels).

WhatsApp: Mobile-first, End-to-End Encryption (different architecture, usually Long Polling/Push).

2. Requirements & Goals

2.1 Functional Requirements

1-on-1 & Group Chat: Send/Receive messages instantly.
Channels: Support for large channels (e.g., Discord servers with 500k members).
Presence: Show Online/Offline status in real-time.
History: Infinite scroll of message history.
Multi-Device: Sync state between Phone and Laptop.

2.2 Non-Functional Requirements

Low Latency: Message delivery < 50ms (within the same region).
High Availability: 99.99%. Chat is often business-critical.
Scalability: Handle 10 Million concurrent connections.
Consistency: Messages must appear in the correct order (Total Ordering within a channel).

3. Capacity Estimation

3.1 Traffic Analysis

DAU: 20 Million.
Concurrent Users: 10 Million (Peak).
Messages: 50 msg/user/day → 1 Billion msg/day.
Write QPS: 10⁹ / 86400 ≈ 11,500 msg/sec.
Peak QPS: 5x Average → ~60,000 msg/sec.

3.2 Bandwidth & Storage

Avg Message Size: 100 Bytes.
Ingress Bandwidth: 60k × 100 Bytes = 6 MB/s (Trivial).
Egress Bandwidth (Fanout):
If a user posts to a channel with 10k online users: 100 Bytes × 10,000 = 1 MB for a single message.
This Fanout is the bottleneck.
Storage: 1 Billion msg/day × 100 Bytes = 100 GB/day.
5 Years: 100 GB × 365 × 5 ≈ 180 TB.
Conclusion: We need a sharded NoSQL store (Cassandra/ScyllaDB) for history.

4. System APIs

We use a hybrid approach: REST for actions (Login, Join Channel, Upload File) and WebSockets for real-time events.

Method	Endpoint	Description
`POST`	`/v1/login`	Authenticates user, returns `auth_token` and `gateway_url`.
`POST`	`/v1/channels/{id}/messages`	Sends a message. Payload: `{ content: "Hello" }`
`GET`	`/v1/channels/{id}/history`	Fetches old messages. Params: `before_id=...`
`WS`	`/gateway`	WebSocket Handshake. Params: `token=...`

5. Database Design

5.1 Cassandra (Message History)

We need massive write throughput and range queries (get messages by time).

Partition Key: channel_id (Groups all messages for a channel together).
Clustering Key: message_id (Snowflake ID, time-sorted).

CREATE TABLE channel_messages (
  channel_id BIGINT,
  message_id BIGINT,
  user_id BIGINT,
  content TEXT,
  created_at TIMESTAMP,
  PRIMARY KEY (channel_id, message_id DESC)
);

5.2 Redis (State & Presence)

User Session: user:{id}:gateway → 10.0.0.5 (Which server holds the TCP connection?)
Presence: user:{id}:status → online (TTL 30s, refreshed by heartbeat).

6. High-Level Architecture

We move from “Request-Response” to a Stateful Gateway Architecture.

System Architecture: Real-Time Chat

Stateful Gateway | Redis Pub/Sub Fanout | Cassandra History

WebSocket Path

Pub/Sub Fanout

Persistence Path

User Devices

Gateway Cluster

Backend Services

👤

User A

(Sender)

👥

Users B, C, D

(Receivers)

WS Gateways

Gateway 1

Holds User A

Gateway 2

Holds Users B, C, D

Chat Service

Orchestrator

Redis Pub/Sub

Channel Fanout

Cassandra

Message Logs

Service Discovery

ZooKeeper / Etcd

7. Component Design (Deep Dive)

7.1 Gateway Aggregation

A user might belong to 100 channels. If we subscribe the User’s Gateway connection to 100 Redis channels, Redis will be overwhelmed by the number of subscriptions.

Naive Approach: 10M Users × 100 Channels = 1 Billion Redis Subscriptions. Too slow.
Optimized Approach: The Gateway subscribes to Redis channels, not the user.
If User A (on GW-1) and User B (on GW-1) are both in #general, GW-1 subscribes to #general once.
When GW-1 receives a message for #general, it looks up its local Channel → [Socket] map and fans out locally in memory.

7.2 Presence (Heartbeats)

Presence is a “Heavy Write” problem. 10M users sending “I’m alive” every 5 seconds = 2M writes/sec.

Optimization: Do not write to DB on every heartbeat.
1. Client: Sends heartbeat to Gateway (WebSocket Ping).
2. Gateway: Holds state in memory. Only updates Redis if status changes or TTL is about to expire (e.g., every 30s).
3. Redis: Keys expire automatically (SETEX user:1:status 40 "online"). If Gateway crashes, key expires, user appears offline.

8. Data Partitioning & Sharding

8.1 Sharding Messages (Cassandra)

We shard by channel_id.

Pros: All messages for a channel are on the same node. Reading history is one disk seek.
Cons: The Celebrity Problem. If #general has 1B messages, the partition gets too big.
Fix: Bucket the partition by time. Partition Key = (channel_id, month_year).

8.2 Service Discovery

How does User A know to connect to Gateway-52?

Consistent Hashing: hash(user_id) % N_Gateways.
Problem: If we add gateways, connections break.
Service Discovery (ZooKeeper/Etcd): Gateways register themselves. The Load Balancer asks ZK for an available node and assigns it to the user.

9. Reliability, Caching, & Load Balancing

9.1 The “Unread Count” badge

Calculating unread counts (SELECT count(*) WHERE id > last_read_id) is expensive.

Optimization: Store unread_count in Redis. Increment it when a message arrives. Reset to 0 when user opens the channel.

9.2 Mobile Push Notifications

If the WebSocket is disconnected (App closed), the Gateway cannot push.

Fallback: The Notification Service detects the missing WebSocket connection and sends a payload to APNS (iOS) or FCM (Android).

10. Interactive Decision Visualizer: Pub/Sub Propagation

Visualize how a single message fans out through Redis to multiple Gateways and Users.

Pub/Sub Propagation Simulator

Trace a message from Alice to Bob & Charlie

👩

Alice

GW 1

Redis Pub/Sub

GW 2

👨

Bob

👴

Charlie

Ready to send.

11. Interview Gauntlet

Q1: How do you handle “Typing…” indicators?

Answer: Typing indicators are ephemeral. Do not store them in the DB. Use a lightweight Redis Pub/Sub channel. Use “Debouncing” on the client to send a signal only once every 2 seconds while typing, not on every keystroke.

Q2: What happens if a user is in 500 channels? Do they keep 500 WebSocket connections?

Answer: No. One WebSocket connection per device. The Gateway multiplexes messages from all 500 channels down that single pipe.

Q3: How do you sync messages across devices (Phone + Laptop)?

Answer: Each device has a unique device_id. When a message is sent, the server pushes it to all device_ids associated with the user_id (except the sender).

Q4: Why not use HTTP Long Polling?

Answer: Long polling is inefficient for chat because of the header overhead and latency in re-establishing connections. WebSockets are preferred for bi-directional, low-latency comms.

Q5: How do you sort messages if two people send at the exact same millisecond?

Answer: Use Snowflake IDs (Twitter’s ID generator) which are roughly time-ordered. If timestamps are identical, sort by worker_id or sequence_id embedded in the Snowflake.

12. Summary: The Whiteboard Strategy

1. Requirements

Func: Chat, History, Presence.
Scale: 10M Concurrent, < 50ms Latency.

2. Architecture

[Client] <-> [Gateway] <-> [Redis] | [Cassandra] (History)

* Gateway: Stateful WebSocket holder. * Redis: Pub/Sub for routing.

3. Data & API

          WS /gateway → Connect

          Cassandra: (channel_id, message_id)

4. Deep Dives

Fanout: Gateway subscribes, not User.
Presence: Heartbeats to Redis with TTL.