Disk Scheduling: The Elevator Problem

[!NOTE] The Mechanical Limit In a Hard Disk Drive (HDD), the read/write head must physically move to the correct track (Seek) and wait for the sector to spin under it (Rotation). Minimizing this physical movement is the #1 performance goal for HDDs.


1. Anatomy of Latency

When an application asks for data, the time taken is: Latency = Seek Time + Rotational Latency + Transfer Time

  1. Seek Time: Moving the arm. (Slowest part: ~3-10ms).
  2. Rotational Latency: Waiting for the disk to spin. (Avg: 4ms at 7200 RPM).
  3. Transfer Time: Reading the bits. (Fast).

2. Scheduling Algorithms

The OS acts like an Elevator Dispatcher. If requests come for floors 10, 50, and 5, it shouldn’t go 10 → 50 → 5.

A. FCFS (First-Come, First-Served)

  • Process requests in order.
  • Problem: “Wild Swings” (1 → 100 → 2 → 99). Terrible seek time.

B. SSTF (Shortest Seek Time First)

  • Go to the closest request next.
  • Problem: Starvation. If requests keep hitting tracks 10-20, a request for track 100 will never be served.

C. SCAN (The Elevator)

  • Move continuously from 0 → 100, servicing requests. Then reverse 100 → 0.
  • Problem: Middle tracks get serviced twice as often (going up and down).

D. C-SCAN (Circular SCAN)

  • Move 0 → 100. Then immediately jump back to 0 without servicing.
  • Pros: Uniform wait time.

3. Interactive: The Elevator Visualizer

Be the scheduler. Add requests and see how the head moves.

0
50
100
HEAD
Add requests to start.

4. The Modern Era: SSDs and NVMe

Wait… SSDs don’t have moving heads. Why do we still have schedulers?

Actually, for SSDs, we often disable complex scheduling (using noop or none scheduler). Why?

  1. Seek is Free: Reading address 0 or 1000000 takes the same time.
  2. Internal Parallelism: SSDs have multiple NAND chips. We want to send requests in parallel, not serialize them into a SCAN line.

Multi-Queue Block Layer (blk-mq)

Modern Linux uses blk-mq. Instead of one global request queue (which requires a global Lock), we have one queue per CPU core.

  • Requests are pushed to the software queue.
  • Batched and sent to the NVMe hardware queues.

io_uring: Asynchronous I/O Done Right

Standard read() blocks the thread. aio was broken. io_uring (introduced in Linux 5.1) uses two ring buffers in shared memory:

  1. Submission Queue (SQ): App pushes requests.
  2. Completion Queue (CQ): Kernel pushes results. This allows Zero Syscall Overhead (in polling mode).

5. Code Example: Request Merging Simulation (Go)

The OS tries to merge adjacent requests to reduce overhead. Read sectors 1-2 and 3-4? Merge into Read 1-4.

```go package main import ( "fmt" "sort" ) type IORequest struct { StartBlock int Count int } func main() { // Incoming requests: [10..12], [100..101], [12..14] queue := []IORequest{ {10, 2}, // Blocks 10, 11 {100, 1}, // Block 100 {12, 2}, // Blocks 12, 13 } fmt.Println("Before Merge:", queue) merged := mergeRequests(queue) fmt.Println("After Merge: ", merged) } func mergeRequests(reqs []IORequest) []IORequest { if len(reqs) == 0 { return nil } // 1. Sort by StartBlock sort.Slice(reqs, func(i, j int) bool { return reqs[i].StartBlock < reqs[j].StartBlock }) var result []IORequest current := reqs[0] for i := 1; i < len(reqs); i++ { next := reqs[i] // Check if next starts where current ends if current.StartBlock+current.Count == next.StartBlock { // Merge! current.Count += next.Count } else { result = append(result, current) current = next } } result = append(result, current) return result } ```