Distributed Scheduling Systems

[!NOTE] This module explores the core principles of Distributed Scheduling Systems, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Hook: “Run this at 9:00 AM”

Single server: cron. Easy. Cluster of 100 servers:

  • If zero servers run it → Outage.
  • If two servers run it → Double Billing (Disaster).
  • If the server crashes during execution → Job lost.

2. The Architecture

Core Structure: A Distributed Priority Queue. Nodes: Workers that poll for jobs.

  1. Storage: DB table [Job<sub>ID</sub>, Next<sub>Run_Time</sub>, Status].
  2. The Poller: Query SELECT * FROM Jobs WHERE Next<sub>Run_Time</sub> < NOW() AND Status = 'PENDING'.
  3. The Lock: How to ensure only ONE worker grabs it?
    • UPDATE Jobs SET Status='RUNNING', Owner='Worker1' WHERE ID=X AND Status='PENDING'.
    • Atomic CAS (Check-And-Set).

3. Handling Failure: Leases

What if Worker1 grabs the job and dies? The job stays ‘RUNNING’ forever. Solution: Leases (Timeouts).

  • Worker1 sets Lease<sub>Exp</sub> = NOW() + 5min.
  • Worker must “heartbeat” to extend lease.
  • If lease expires, another worker can steal the job.

4. Implementation: Delay Queue (Java)

// Using Java's DelayQueue
class ScheduledJob implements Delayed {
  long executeTime;

  public long getDelay(TimeUnit unit) {
    return unit.convert(executeTime - System.currentTimeMillis(), TimeUnit.MILLISECONDS);
  }

  public int compareTo(Delayed other) {
    return Long.compare(this.executeTime, ((ScheduledJob)other).executeTime);
  }
}

// Worker Loop
DelayQueue<ScheduledJob> queue = new DelayQueue<>();
while (true) {
  ScheduledJob job = queue.take(); // Blocks until executeTime!
  run(job);
}

5. Deep Dive Strategy Lab: Scheduling

Intuition Through Analogy

The Restaurant Ticket Wheel. The Chef (storage) holds the tickets sorted by time. Cooks (workers) grab a ticket. If a Cook walks out the back door with a ticket and doesn’t return (crash), the Manager (Lease Monitor) re-prints the ticket after 10 mins for another cook.

Mathematical Anchor: Jitter

If 10,000 jobs are scheduled for 9:00:00 AM, and 1000 workers wake up to grab them, you ddos your DB. Solution: Add Jitter (Randomness). RunTime = 9:00 AM + random(0, 60s). This spreads the load uniformally over the minute.