Design a Web Crawler

[!NOTE] This module explores the core principles of Design a Web Crawler, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. What is a Web Crawler?

A Web Crawler (or Spider) is a bot that systematically browses the World Wide Web, typically for the purpose of Web Indexing (Google Search). It starts with a list of “Seed URLs”, downloads the page, extracts links, and adds them to the queue.

Use Cases

Search Engine Indexing: Google, Bing.
Web Archiving: Wayback Machine.
Web Mining: Financial analysis, Copyright checks, LLM Training Data.

Try it yourself: Open robots.txt of any major site (e.g., google.com/robots.txt). See the Disallow rules. This is what the crawler must respect.

2. Requirements & Goals

Functional Requirements

Crawl: Fetch content from a given URL.
Extract: Parse HTML and find new links (<a href>).
Store: Save the content for the Indexer.

Non-Functional Requirements

Scalability: The web is huge (> billions of pages).
Politeness: DO NOT DDoS the target servers. Respect robots.txt.
Robustness: Handle malformed HTML, 404s, infinite redirect loops, and spider traps.
Extensibility: Support new content types (PDF, Images).

3. Capacity Estimation

Let’s design for 1 Billion Pages per Month.

Throughput (QPS)

1 Billion / 30 days / 24 hours / 3600 sec ≈ 400 Pages/sec.
Peak traffic: 2× = 800 QPS.
This seems low, but fetching a page is slow (DNS + Network + Parsing). We need massive concurrency (thousands of threads).

Storage

Avg page size: 100 KB (HTML only).
Storage = 1B × 100 KB = 100 TB / month.
In 5 years: 6 PB. (We need BigTable / HBase).

4. System APIs

The Crawler is an internal system, but the components communicate via RPC (gRPC).

4.1 Frontier Service (Task Queue)

service Frontier {
  // Workers ask for a batch of URLs to crawl
  rpc GetCrawlTasks(WorkerID) returns (stream CrawlTask);

  // Workers report found links
  rpc ReportLinks(ReportRequest) returns (Ack);
}

message CrawlTask {
  string url = 1;
  string domain = 2;
  int32 depth = 3;
}

4.2 HTML Fetcher Service

POST /fetch
{
  "url": "https://example.com",
  "timeout": 5000,
  "follow_redirects": true
}

Response:

{
  "status": 200,
  "content_type": "text/html",
  "body_hash": "abc12345",
  "content": "<html>...</html>"
}

5. Database Design

1. URL Frontier (Queue System)

Stores URLs to be visited. Needs priority and politeness logic.

Redis / Kafka: Fast, in-memory queues.
Architecture: Split into “Front Queues” (Priority) and “Back Queues” (Politeness).

2. Content Storage (NoSQL)

Stores the raw HTML.

BigTable / HBase / Cassandra: Ideal for high write throughput and large data. See Module 15: Deep Dive Data.
Schema: RowKey: URL_Hash, Col: HTML_Content, Col: Metadata, Col: SimHash.

3. Duplicate Detection

We cannot run SELECT * FROM visited WHERE url = ? for billions of rows.

Bloom Filter: Probabilistic data structure to check “Have we seen this URL?”.

6. High-Level Design

Web Crawler Pipeline Architecture.

System Architecture: Web Crawler (Spider)

URL Frontier | Async DNS | Distributed Fetchers | Deduplication

Fetch Flow

Extraction Loop

DNS/Checks

URL Frontier

Workers & Pipeline

Storage Layer

Seed URLs

Front Queues (F1..Fn)

Prioritizer (PageRank, Relevance)

Back Queues (B1..Bn)

Politeness Router (1 Queue per Host)

DNS Resolver

Async Custom Cache

HTML Downloader

• HTTP Client • Robots.txt Check • Distributed Workers

Content Parser

• Validation • Link Extraction • De-dup (SimHash)

Bloom Filter / Url Seen Test

BigTable / HBase

Col: HTML Content Col: Metadata Row: URL Hash

Seed URLs: Manually curated list (cnn.com, wikipedia.org).
URL Frontier: Distributes URLs to workers (Prioritizer & Politeness).
HTML Downloader: Fetches pages using Async DNS.
Content Parser: Validates HTML, extracts links.
Duplicate Eliminator: Checks Bloom Filter and SimHash (Content Dedup).
Storage: Saves content to BigTable/HBase.
Loop: New links go back to Frontier.

7. Component Design (Deep Dive)

The URL Frontier (Priority vs Politeness)

Visualized in the URL Frontier zone of the diagram. This is the most complex part. It has two main buffers:

Front Queue (Prioritizer):
- Prioritize URLs based on PageRank or change frequency.
- High Priority Q, Medium Q, Low Q.
Back Queue (Politeness Router):
- The Rule: Do not hit wikipedia.org more than once per second.
- We map wikipedia.org to Queue #5.
- A “Queue Router” ensures that Queue #5 is only consumed by one worker thread, enforcing the delay.
- This implements a Rate Limiter. See Module 10: Simple Services (Rate Limiter).

Robot Exclusion Standard (Ethics)

Before crawling example.com, we fetch example.com/robots.txt.

Rules: User-agent: * Disallow: /private/.
Legal: While not a law, ignoring it gets you IP banned and sued.
Implementation: We cache the Robots.txt for every domain in a KV Store (Redis) with a TTL.

Bloom Filters (Mathematical Magic)

A Bloom Filter is a bit array of size m with k hash functions.

Add Item: Hash it k times, set bits to 1.
Check Item: Hash it k times. If all bits are 1, it might exist. If any bit is 0, it definitely does not exist.
False Positive: We might think we visited a page when we haven’t. This is acceptable for a crawler (we just miss one update).
Space Savings: Stores billions of URLs in RAM.

DNS Resolver Caching (Why Custom?)

Visualized as DNS Resolver in the diagram. DNS lookups take 10ms - 200ms. If we crawl 100 pages from nytimes.com, we don’t want 100 DNS requests.

The OS Problem: Standard libc getaddrinfo is Blocking. If you have 1000 threads, and 500 are waiting on DNS, your CPU is idle.
The Solution: Build a custom Async DNS Resolver over UDP. We fire a UDP packet, register a callback, and move on. No thread blocking.
Caching: We cache hostname → IP in memory.

Content Deduplication (SimHash)

How do we know if two pages are 99% similar (Near-Duplicate)?

Exact Match (MD5): Useless. Changing one character (“the” → “teh”) changes the entire hash.
SimHash Algorithm:
1. Tokenize: Break text into words/features.
2. Hash: Hash each word into 64-bits.
3. Vector Sum: Start with a vector V = [0,0...0]. For each bit in the word hash: if 1, add weight to V[i]; if 0, subtract weight.
4. Fingerprint: If V[i] > 0, set final bit to 1. Else 0.
Hamming Distance: If the SimHash of Doc A and Doc B differ by only 2 or 3 bits, they are near-duplicates. We discard the new one.

8. Requirements Traceability

Requirement	Design Decision	Justification
Politeness	Politeness Queue / Router	Ensures we never exceed QPS limits for a target domain.
Throughput	Async DNS Resolver	Prevents thread blocking during DNS lookups, increasing concurrency.
Scalability	Bloom Filter	Tracks billions of visited URLs in RAM to prevent cycles.
Storage	BigTable / HBase	NoSQL Wide Column Store handles massive write throughput and PB of data.
Freshness	Priority Queue (PageRank)	Ensures important pages (high PageRank) are crawled more frequently.

9. Observability & Metrics

We need to ensure the crawler isn’t stuck or banned.

Key Metrics

Crawl Rate: Pages/sec per worker.
DNS Latency: Avg time to resolve IP. Spike indicates DNS throttling.
Error Rate: 4xx/5xx responses. High 403 means we are IP banned (Rotate Proxy).
Queue Depth: If Front Queue grows but Back Queue is empty, the Router is slow.

Trap Detection

Path Depth: Alert if url_depth > 20.
Domain Concentration: Alert if 50% of traffic is hitting one domain (Spider Trap).

10. Deployment Strategy

Rolling Updates

We have thousands of worker nodes.

Drain: Stop assigning new tasks to Worker A.
Wait: Wait for active crawls to finish (graceful shutdown).
Update: Deploy new binary.
Resume: Re-register with Frontier.

Resiliency

If a worker crashes:

The Frontier detects heartbeat failure.
The URLs assigned to that worker are re-queued to another worker.

11. Interview Gauntlet

Rapid Fire Questions

How do you handle ‘robots.txt’? We cache it for every domain. Before crawling, we check the rules. We respect User-agent: * Disallow: /admin.
What if the page renders with JavaScript (React/Angular)? Standard HTTP Get returns empty HTML. We need a “Headless Browser” (Chrome/Puppeteer) service to render the DOM. This is 10x slower/expensive.
How do you detect infinite loops? Bloom Filters for visited URLs and Max Depth limit.
Why not use a standard Hash Set instead of Bloom Filter? 1 Billion URLs * 100 bytes = 100 GB RAM. Bloom Filter uses ~1 GB.
How do you update ‘SimHash’ efficiently? We cannot scan all rows. We split the 64-bit hash into 4 16-bit tables. We search for matches in these smaller tables (Pigeonhole Principle).

12. Interactive Decision Visualizer: The Bloom Filter

This simulator demonstrates how a Bloom Filter saves memory.

Add URL: Hashes the URL to 3 positions and flips bits to 1 (Green).
Check URL: Hashes the URL. If ALL corresponding bits are 1, it says “Probably Present”.
Collision: Try adding “google.com” then checking “yahoo.com”. If their bits overlap, you might get a False Positive (rare with large arrays).

Bit Array (Size: 20)

      System Ready.
  

13. Summary (Whiteboard)

Scale & Capacity

1 Billion Pages/Month
Politeness is the limiter (not CPU)
Async DNS is critical

Core Algorithms

Bloom Filter: Deduplication (RAM efficient)
SimHash: Content Similarity
Politeness Router: Queue Mapping

Storage

BigTable/HBase: Wide Column Store
Row Key: Reverse URL (`com.google.www`)
Col: Content, Metadata

Reliability

Spider Traps: Max Depth check
Robots.txt: Respect rules
Dynamic Content: Headless Browser