Design a Web Crawler
[!NOTE] This module explores the core principles of Design a Web Crawler, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. What is a Web Crawler?
A Web Crawler (or Spider) is a bot that systematically browses the World Wide Web, typically for the purpose of Web Indexing (Google Search). It starts with a list of “Seed URLs”, downloads the page, extracts links, and adds them to the queue.
Use Cases
- Search Engine Indexing: Google, Bing.
- Web Archiving: Wayback Machine.
- Web Mining: Financial analysis, Copyright checks, LLM Training Data.
Try it yourself: Open
robots.txtof any major site (e.g.,google.com/robots.txt). See theDisallowrules. This is what the crawler must respect.
2. Requirements & Goals
Functional Requirements
- Crawl: Fetch content from a given URL.
- Extract: Parse HTML and find new links (
<a href>). - Store: Save the content for the Indexer.
Non-Functional Requirements
- Scalability: The web is huge (> billions of pages).
- Politeness: DO NOT DDoS the target servers. Respect
robots.txt. - Robustness: Handle malformed HTML, 404s, infinite redirect loops, and spider traps.
- Extensibility: Support new content types (PDF, Images).
3. Capacity Estimation
Let’s design for 1 Billion Pages per Month.
Throughput (QPS)
- 1 Billion / 30 days / 24 hours / 3600 sec ≈ 400 Pages/sec.
- Peak traffic: 2× = 800 QPS.
- This seems low, but fetching a page is slow (DNS + Network + Parsing). We need massive concurrency (thousands of threads).
Storage
- Avg page size: 100 KB (HTML only).
- Storage = 1B × 100 KB = 100 TB / month.
- In 5 years: 6 PB. (We need BigTable / HBase).
4. System APIs
The Crawler is an internal system, but the components communicate via RPC (gRPC).
4.1 Frontier Service (Task Queue)
service Frontier {
// Workers ask for a batch of URLs to crawl
rpc GetCrawlTasks(WorkerID) returns (stream CrawlTask);
// Workers report found links
rpc ReportLinks(ReportRequest) returns (Ack);
}
message CrawlTask {
string url = 1;
string domain = 2;
int32 depth = 3;
}
4.2 HTML Fetcher Service
POST /fetch
{
"url": "https://example.com",
"timeout": 5000,
"follow_redirects": true
}
Response:
{
"status": 200,
"content_type": "text/html",
"body_hash": "abc12345",
"content": "<html>...</html>"
}
5. Database Design
1. URL Frontier (Queue System)
Stores URLs to be visited. Needs priority and politeness logic.
- Redis / Kafka: Fast, in-memory queues.
- Architecture: Split into “Front Queues” (Priority) and “Back Queues” (Politeness).
2. Content Storage (NoSQL)
Stores the raw HTML.
- BigTable / HBase / Cassandra: Ideal for high write throughput and large data. See Module 15: Deep Dive Data.
- Schema:
RowKey: URL_Hash,Col: HTML_Content,Col: Metadata,Col: SimHash.
3. Duplicate Detection
We cannot run SELECT * FROM visited WHERE url = ? for billions of rows.
- Bloom Filter: Probabilistic data structure to check “Have we seen this URL?”.
6. High-Level Design
Web Crawler Pipeline Architecture.
- Seed URLs: Manually curated list (cnn.com, wikipedia.org).
- URL Frontier: Distributes URLs to workers (Prioritizer & Politeness).
- HTML Downloader: Fetches pages using Async DNS.
- Content Parser: Validates HTML, extracts links.
- Duplicate Eliminator: Checks Bloom Filter and SimHash (Content Dedup).
- Storage: Saves content to BigTable/HBase.
- Loop: New links go back to Frontier.
7. Component Design (Deep Dive)
The URL Frontier (Priority vs Politeness)
Visualized in the URL Frontier zone of the diagram. This is the most complex part. It has two main buffers:
- Front Queue (Prioritizer):
- Prioritize URLs based on PageRank or change frequency.
- High Priority Q, Medium Q, Low Q.
- Back Queue (Politeness Router):
- The Rule: Do not hit
wikipedia.orgmore than once per second. - We map
wikipedia.orgtoQueue #5. - A “Queue Router” ensures that
Queue #5is only consumed by one worker thread, enforcing the delay. - This implements a Rate Limiter. See Module 10: Simple Services (Rate Limiter).
- The Rule: Do not hit
Robot Exclusion Standard (Ethics)
Before crawling example.com, we fetch example.com/robots.txt.
- Rules:
User-agent: * Disallow: /private/. - Legal: While not a law, ignoring it gets you IP banned and sued.
- Implementation: We cache the
Robots.txtfor every domain in a KV Store (Redis) with a TTL.
Bloom Filters (Mathematical Magic)
A Bloom Filter is a bit array of size m with k hash functions.
- Add Item: Hash it k times, set bits to 1.
- Check Item: Hash it k times. If all bits are 1, it might exist. If any bit is 0, it definitely does not exist.
- False Positive: We might think we visited a page when we haven’t. This is acceptable for a crawler (we just miss one update).
- Space Savings: Stores billions of URLs in RAM.
DNS Resolver Caching (Why Custom?)
Visualized as DNS Resolver in the diagram.
DNS lookups take 10ms - 200ms. If we crawl 100 pages from nytimes.com, we don’t want 100 DNS requests.
- The OS Problem: Standard libc
getaddrinfois Blocking. If you have 1000 threads, and 500 are waiting on DNS, your CPU is idle. - The Solution: Build a custom Async DNS Resolver over UDP. We fire a UDP packet, register a callback, and move on. No thread blocking.
- Caching: We cache
hostname → IPin memory.
Content Deduplication (SimHash)
How do we know if two pages are 99% similar (Near-Duplicate)?
- Exact Match (MD5): Useless. Changing one character (“the” → “teh”) changes the entire hash.
- SimHash Algorithm:
- Tokenize: Break text into words/features.
- Hash: Hash each word into 64-bits.
- Vector Sum: Start with a vector
V = [0,0...0]. For each bit in the word hash: if 1, add weight toV[i]; if 0, subtract weight. - Fingerprint: If
V[i] > 0, set final bit to 1. Else 0.
- Hamming Distance: If the SimHash of Doc A and Doc B differ by only 2 or 3 bits, they are near-duplicates. We discard the new one.
8. Requirements Traceability
| Requirement | Design Decision | Justification |
|---|---|---|
| Politeness | Politeness Queue / Router | Ensures we never exceed QPS limits for a target domain. |
| Throughput | Async DNS Resolver | Prevents thread blocking during DNS lookups, increasing concurrency. |
| Scalability | Bloom Filter | Tracks billions of visited URLs in RAM to prevent cycles. |
| Storage | BigTable / HBase | NoSQL Wide Column Store handles massive write throughput and PB of data. |
| Freshness | Priority Queue (PageRank) | Ensures important pages (high PageRank) are crawled more frequently. |
9. Observability & Metrics
We need to ensure the crawler isn’t stuck or banned.
Key Metrics
- Crawl Rate: Pages/sec per worker.
- DNS Latency: Avg time to resolve IP. Spike indicates DNS throttling.
- Error Rate: 4xx/5xx responses. High 403 means we are IP banned (Rotate Proxy).
- Queue Depth: If Front Queue grows but Back Queue is empty, the Router is slow.
Trap Detection
- Path Depth: Alert if
url_depth > 20. - Domain Concentration: Alert if 50% of traffic is hitting one domain (Spider Trap).
10. Deployment Strategy
Rolling Updates
We have thousands of worker nodes.
- Drain: Stop assigning new tasks to Worker A.
- Wait: Wait for active crawls to finish (graceful shutdown).
- Update: Deploy new binary.
- Resume: Re-register with Frontier.
Resiliency
If a worker crashes:
- The Frontier detects heartbeat failure.
- The URLs assigned to that worker are re-queued to another worker.
11. Interview Gauntlet
Rapid Fire Questions
- How do you handle ‘robots.txt’? We cache it for every domain. Before crawling, we check the rules. We respect
User-agent: * Disallow: /admin. - What if the page renders with JavaScript (React/Angular)? Standard HTTP Get returns empty HTML. We need a “Headless Browser” (Chrome/Puppeteer) service to render the DOM. This is 10x slower/expensive.
- How do you detect infinite loops? Bloom Filters for visited URLs and Max Depth limit.
- Why not use a standard Hash Set instead of Bloom Filter? 1 Billion URLs * 100 bytes = 100 GB RAM. Bloom Filter uses ~1 GB.
- How do you update ‘SimHash’ efficiently? We cannot scan all rows. We split the 64-bit hash into 4 16-bit tables. We search for matches in these smaller tables (Pigeonhole Principle).
12. Interactive Decision Visualizer: The Bloom Filter
This simulator demonstrates how a Bloom Filter saves memory.
- Add URL: Hashes the URL to 3 positions and flips bits to 1 (Green).
- Check URL: Hashes the URL. If ALL corresponding bits are 1, it says “Probably Present”.
- Collision: Try adding “google.com” then checking “yahoo.com”. If their bits overlap, you might get a False Positive (rare with large arrays).
13. Summary (Whiteboard)
Scale & Capacity
- 1 Billion Pages/Month
- Politeness is the limiter (not CPU)
- Async DNS is critical
Core Algorithms
- Bloom Filter: Deduplication (RAM efficient)
- SimHash: Content Similarity
- Politeness Router: Queue Mapping
Storage
- BigTable/HBase: Wide Column Store
- Row Key: Reverse URL (`com.google.www`)
- Col: Content, Metadata
Reliability
- Spider Traps: Max Depth check
- Robots.txt: Respect rules
- Dynamic Content: Headless Browser