Distributed Scheduling Systems
[!NOTE] This module explores the core principles of Distributed Scheduling Systems, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Hook: “Run this at 9:00 AM”
Single server: cron. Easy.
Cluster of 100 servers:
- If zero servers run it → Outage.
- If two servers run it → Double Billing (Disaster).
- If the server crashes during execution → Job lost.
2. The Architecture
Core Structure: A Distributed Priority Queue. Nodes: Workers that poll for jobs.
- Storage: DB table
[Job<sub>ID</sub>, Next<sub>Run_Time</sub>, Status]. - The Poller: Query
SELECT * FROM Jobs WHERE Next<sub>Run_Time</sub> < NOW() AND Status = 'PENDING'. - The Lock: How to ensure only ONE worker grabs it?
UPDATE Jobs SET Status='RUNNING', Owner='Worker1' WHERE ID=X AND Status='PENDING'.- Atomic CAS (Check-And-Set).
3. Handling Failure: Leases
What if Worker1 grabs the job and dies? The job stays ‘RUNNING’ forever. Solution: Leases (Timeouts).
- Worker1 sets
Lease<sub>Exp</sub> = NOW() + 5min. - Worker must “heartbeat” to extend lease.
- If lease expires, another worker can steal the job.
4. Implementation: Delay Queue (Java)
// Using Java's DelayQueue
class ScheduledJob implements Delayed {
long executeTime;
public long getDelay(TimeUnit unit) {
return unit.convert(executeTime - System.currentTimeMillis(), TimeUnit.MILLISECONDS);
}
public int compareTo(Delayed other) {
return Long.compare(this.executeTime, ((ScheduledJob)other).executeTime);
}
}
// Worker Loop
DelayQueue<ScheduledJob> queue = new DelayQueue<>();
while (true) {
ScheduledJob job = queue.take(); // Blocks until executeTime!
run(job);
}
5. Deep Dive Strategy Lab: Scheduling
Intuition Through Analogy
The Restaurant Ticket Wheel. The Chef (storage) holds the tickets sorted by time. Cooks (workers) grab a ticket. If a Cook walks out the back door with a ticket and doesn’t return (crash), the Manager (Lease Monitor) re-prints the ticket after 10 mins for another cook.
Mathematical Anchor: Jitter
If 10,000 jobs are scheduled for 9:00:00 AM, and 1000 workers wake up to grab them, you ddos your DB.
Solution: Add Jitter (Randomness).
RunTime = 9:00 AM + random(0, 60s).
This spreads the load uniformally over the minute.