Design an Ad Click Aggregator
[!IMPORTANT] In this lesson, you will master:
- High-Throughput Ingestion: Building intuition behind buffering billions of clicks using Kafka to prevent systemic collapse during traffic spikes.
- Exactly-Once Processing: Understanding how Apache Flink guarantees precision through distributed checkpointing and two-phase commits.
- OLAP Analytics: Why traditional databases fail at scale and how columnar stores like ClickHouse enable real-time dashboarding.
1. Problem Statement
In the world of online advertising, every click counts—literally. An Ad Click Aggregator is a system that processes billions of click events from various websites and mobile apps to provide real-time reports for advertisers and billing services.
The Challenge:
- Scale: Handling billions of clicks per day.
- Precision: We cannot overcharge (double counting) or undercharge (missing clicks).
- Latency: Advertisers want to see their budget consumption in near real-time (< 1 minute).
Real-World Examples:
- Google Ads: Real-time bidding and budget tracking.
- Facebook Ads: Aggregating impressions and clicks for campaign performance.
[!TIP] Analogy: The Toll Booth Imagine a massive highway with 1,000 lanes. Every time a car passes, you need to count it to bill the driver. If you miss a car, you lose money. If you count a car twice because it swerved between lanes, the driver gets angry. This system is the digital version of that toll booth, but for billions of “cars” (clicks).
[!NOTE] War Story: The Super Bowl Ad Blackout How Company X handled a Thundering Herd problem during a massive sports event. Millions of users clicked an interactive ad at the exact same moment, flooding the ingest API. They survived by instantly scaling their Kafka partitions and heavily relying on asynchronous processing to buffer the spike, preventing database collapse.
2. Requirements & Goals (P - Problem & Requirements)
Functional Requirements
- Aggregate Clicks: Count clicks per
ad_idover time windows (e.g., 1 minute, 1 hour). - Filter Fraud: Detect and filter out bot clicks or duplicates.
- Queryable Results: Provide an API for internal dashboards to query aggregated stats.
Non-Functional Requirements
- Exactly-Once Processing: Every click must be counted exactly once, despite failures.
- High Throughput: Support up to 100,000 clicks/sec.
- Low Latency: End-to-end delay from click to dashboard should be < 5 seconds.
- Fault Tolerance: The system must recover from node crashes without data loss.
3. Capacity Estimation (E - Estimation)
Traffic Estimates
Assume 100,000 clicks per second.
- Daily Clicks: 100,000 * 86,400 = 8.64 Billion clicks/day.
Storage Estimates
- Raw Event Size: ~100 bytes.
- Daily Storage (Raw): 8.64 Billion * 100 bytes = 864 GB/day.
- Aggregated Storage: If we aggregate by
(ad_id, minute), and we have 1 million active ads, that is 1 million rows per minute. 1M * 50 bytes = 50 MB/minute = 72 GB/day.
4. Data Model (D - Data Model)
We need to define the schema for the raw events coming in and the aggregated data stored in the OLAP database.
Raw Click Event (Kafka / Flink Ingest)
{
"click_id": "uuid-1234",
"ad_id": "ad-987",
"user_id": "u-456",
"ip_address": "192.168.1.1",
"timestamp": "2023-10-27T10:00:00Z"
}
Aggregated Ad Stats (ClickHouse) We store pre-aggregated windows to reduce database load.
| Column | Type | Description |
|---|---|---|
ad_id |
String | The unique identifier for the ad. |
window_start |
Timestamp | The start of the 1-minute aggregation window. |
click_count |
UInt64 | The number of valid clicks in this window. |
5. High-Level Architecture (A - Architecture)
Interview-Friendly High-Level Diagram
This is the simplified version of the architecture you should draw on the whiteboard.
graph TD
Click([Ad Click]) --> LB[Load Balancer]
subgraph Ingestion
LB --> API[Log Ingestor API]
API --> Kafka[Kafka Queue]
end
subgraph Stream Processing
Kafka --> Flink[Apache Flink]
Flink -- Fraud Lookup --> Redis[(Redis TTL)]
end
subgraph Storage & Analytics
Flink -- Batched Aggregates --> DB[(ClickHouse OLAP)]
Dashboard([Advertiser Dashboard]) --> DB
end
We use a Kappa Architecture (Stream Processing) to handle both real-time counts and re-processing.