Design a Web Crawler (Googlebot)

1. What is a Web Crawler?

A Web Crawler (or Spider) is a bot that systematically browses the World Wide Web, typically for the purpose of Web Indexing (Google Search). It starts with a list of “Seed URLs”, downloads the page, extracts links, and adds them to the queue.

Use Cases

  • Search Engine Indexing: Google, Bing.
  • Web Archiving: Wayback Machine.
  • Web Mining: Financial analysis, Copyright checks, LLM Training Data.

Try it yourself: Open robots.txt of any major site (e.g., google.com/robots.txt). See the Disallow rules. This is what the crawler must respect.


2. Requirements & Goals

Functional Requirements

  • Crawl: Fetch content from a given URL.
  • Extract: Parse HTML and find new links (<a href>).
  • Store: Save the content for the Indexer.

Non-Functional Requirements

  • Scalability: The web is huge (> billions of pages).
  • Politeness: DO NOT DDoS the target servers. Respect robots.txt.
  • Robustness: Handle malformed HTML, 404s, infinite redirect loops, and spider traps.
  • Extensibility: Support new content types (PDF, Images).

3. Capacity Estimation

Let’s design for 1 Billion Pages per Month.

Throughput (QPS)

  • 1 Billion / 30 days / 24 hours / 3600 sec &approx; 400 Pages/sec.
  • Peak traffic: 2× = 800 QPS.
  • This seems low, but fetching a page is slow (DNS + Network + Parsing). We need massive concurrency (thousands of threads).

Storage

  • Avg page size: 100 KB (HTML only).
  • Storage = 1B × 100 KB = 100 TB / month.
  • In 5 years: 6 PB. (We need BigTable / HBase).

4. System APIs

The Crawler is an internal system, but the components communicate via RPC (gRPC).

4.1 Frontier Service (Task Queue)

service Frontier {
  // Workers ask for a batch of URLs to crawl
  rpc GetCrawlTasks(WorkerID) returns (stream CrawlTask);

  // Workers report found links
  rpc ReportLinks(ReportRequest) returns (Ack);
}

message CrawlTask {
  string url = 1;
  string domain = 2;
  int32 depth = 3;
}

4.2 HTML Fetcher Service

POST /fetch
{
  "url": "https://example.com",
  "timeout": 5000,
  "follow_redirects": true
}

Response:

{
  "status": 200,
  "content_type": "text/html",
  "body_hash": "abc12345",
  "content": "<html>...</html>"
}

5. Database Design

1. URL Frontier (Queue System)

Stores URLs to be visited. Needs priority and politeness logic.

  • Redis / Kafka: Fast, in-memory queues.
  • Architecture: Split into “Front Queues” (Priority) and “Back Queues” (Politeness).

2. Content Storage (NoSQL)

Stores the raw HTML.

  • BigTable / HBase / Cassandra: Ideal for high write throughput and large data. See Module 15: Deep Dive Data.
  • Schema: RowKey: URL_Hash, Col: HTML_Content, Col: Metadata, Col: SimHash.

3. Duplicate Detection

We cannot run SELECT * FROM visited WHERE url = ? for billions of rows.

  • Bloom Filter: Probabilistic data structure to check “Have we seen this URL?”.

6. High-Level Design

Web Crawler Pipeline Architecture.

System Architecture: Web Crawler (Spider)
URL Frontier | Async DNS | Distributed Fetchers | Deduplication
Fetch Flow
Extraction Loop
DNS/Checks
URL Frontier
Workers & Pipeline
Storage Layer
Seed URLs
Front Queues (F1..Fn)
Prioritizer
(PageRank, Relevance)
Back Queues (B1..Bn)
Politeness Router
(1 Queue per Host)
DNS Resolver
Async Custom Cache
HTML Downloader
• HTTP Client
• Robots.txt Check
• Distributed Workers
Content Parser
• Validation
• Link Extraction
• De-dup (SimHash)
Bloom Filter / Url Seen Test
BigTable / HBase
Col: HTML Content
Col: Metadata
Row: URL Hash
Mapping Logic Fetch Task Resolve IP Raw HTML Store Content Extracted Links New URLs (Feedback Loop)
  1. Seed URLs: Manually curated list (cnn.com, wikipedia.org).
  2. URL Frontier: Distributes URLs to workers (Prioritizer & Politeness).
  3. HTML Downloader: Fetches pages using Async DNS.
  4. Content Parser: Validates HTML, extracts links.
  5. Duplicate Eliminator: Checks Bloom Filter and SimHash (Content Dedup).
  6. Storage: Saves content to BigTable/HBase.
  7. Loop: New links go back to Frontier.

7. Component Design (Deep Dive)

The URL Frontier (Priority vs Politeness)

Visualized in the URL Frontier zone of the diagram. This is the most complex part. It has two main buffers:

  1. Front Queue (Prioritizer):
    • Prioritize URLs based on PageRank or change frequency.
    • High Priority Q, Medium Q, Low Q.
  2. Back Queue (Politeness Router):
    • The Rule: Do not hit wikipedia.org more than once per second.
    • We map wikipedia.org to Queue #5.
    • A “Queue Router” ensures that Queue #5 is only consumed by one worker thread, enforcing the delay.
    • This implements a Rate Limiter. See Module 10: Simple Services (Rate Limiter).

Bloom Filters (Mathematical Magic)

A Bloom Filter is a bit array of size m with k hash functions.

  • Add Item: Hash it k times, set bits to 1.
  • Check Item: Hash it k times. If all bits are 1, it might exist. If any bit is 0, it definitely does not exist.
  • False Positive: We might think we visited a page when we haven’t. This is acceptable for a crawler (we just miss one update).
  • Space Savings: Stores billions of URLs in RAM.

DNS Resolver Caching

Visualized as DNS Resolver in the diagram. DNS lookups take 10ms - 200ms. If we crawl 100 pages from nytimes.com, we don’t want 100 DNS requests.

  • Problem: Standard OS DNS (getaddrinfo) is Synchronous and blocks the thread. With thousands of threads, this becomes a major bottleneck.
  • Solution: We build a custom Asynchronous DNS Resolver (using UDP directly) and maintain a custom DNS Cache (Map: hostname -> IP) inside the crawler.
  • We refresh it periodically (TTL).

Content Deduplication (SimHash)

What if two URLs have different links but identical content?

  • Exact Match: MD5(Content). Fails if one byte differs (e.g., timestamp).
  • Near-Duplicate (SimHash): A fingerprinting algorithm where similar documents have similar hashes (small Hamming Distance). We use this to discard duplicate content.

8. Requirements Traceability

Requirement Design Decision Justification
Politeness Politeness Queue / Router Ensures we never exceed QPS limits for a target domain.
Throughput Async DNS Resolver Prevents thread blocking during DNS lookups, increasing concurrency.
Scalability Bloom Filter Tracks billions of visited URLs in RAM to prevent cycles.
Storage BigTable / HBase NoSQL Wide Column Store handles massive write throughput and PB of data.
Freshness Priority Queue (PageRank) Ensures important pages (high PageRank) are crawled more frequently.

9. Observability & Metrics

We need to ensure the crawler isn’t stuck or banned.

Key Metrics

  • Crawl Rate: Pages/sec per worker.
  • DNS Latency: Avg time to resolve IP. Spike indicates DNS throttling.
  • Error Rate: 4xx/5xx responses. High 403 means we are IP banned (Rotate Proxy).
  • Queue Depth: If Front Queue grows but Back Queue is empty, the Router is slow.

Trap Detection

  • Path Depth: Alert if url_depth > 20.
  • Domain Concentration: Alert if 50% of traffic is hitting one domain (Spider Trap).

10. Deployment Strategy

Rolling Updates

We have thousands of worker nodes.

  1. Drain: Stop assigning new tasks to Worker A.
  2. Wait: Wait for active crawls to finish (graceful shutdown).
  3. Update: Deploy new binary.
  4. Resume: Re-register with Frontier.

Resiliency

If a worker crashes:

  • The Frontier detects heartbeat failure.
  • The URLs assigned to that worker are re-queued to another worker.

11. Interview Gauntlet

Rapid Fire Questions

  1. How do you handle ‘robots.txt’? We cache it for every domain. Before crawling, we check the rules. We respect User-agent: * Disallow: /admin.
  2. What if the page renders with JavaScript (React/Angular)? Standard HTTP Get returns empty HTML. We need a “Headless Browser” (Chrome/Puppeteer) service to render the DOM. This is 10x slower/expensive.
  3. How do you detect infinite loops? Bloom Filters for visited URLs and Max Depth limit.
  4. Why not use a standard Hash Set instead of Bloom Filter? 1 Billion URLs * 100 bytes = 100 GB RAM. Bloom Filter uses ~1 GB.
  5. How do you update ‘SimHash’ efficiently? We cannot scan all rows. We split the 64-bit hash into 4 16-bit tables. We search for matches in these smaller tables (Pigeonhole Principle).

12. Interactive Decision Visualizer: Frontier & Deduplication

See how the URL Frontier processes incoming links.

  1. Bloom Filter: Checks if URL was already visited (Red flash).
  2. Router: Sorts URLs into queues based on Domain.
  3. Politeness: Ensures we wait before hitting the same domain again (Yellow wait state).

Extracted Links

Bloom Filter
🔍
Router

Politeness Queues (1s Delay)

google.com
Idle
wiki.org
Idle
Total Processed: 0 | Duplicates Dropped: 0

13. Summary (Whiteboard)

Scale & Capacity

  • 1 Billion Pages/Month
  • Politeness is the limiter (not CPU)
  • Async DNS is critical

Core Algorithms

  • Bloom Filter: Deduplication (RAM efficient)
  • SimHash: Content Similarity
  • Politeness Router: Queue Mapping

Storage

  • BigTable/HBase: Wide Column Store
  • Row Key: Reverse URL (`com.google.www`)
  • Col: Content, Metadata

Reliability

  • Spider Traps: Max Depth check
  • Robots.txt: Respect rules
  • Dynamic Content: Headless Browser