Distributed OS

[!NOTE] This module explores the core principles of Distributed OS, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Network is the Computer

A Distributed Operating System manages a collection of independent computers and makes them appear to the users as a single coherent system.

The key challenge is Transparency: The user should not know (or care) which machine executes their code or stores their file.

Fallacies of Distributed Computing

The network is reliable. (It fails often).
Latency is zero. (Speed of light is finite).
Bandwidth is infinite. (Congestion happens).
The network is secure. (Packets can be sniffed).

2. Ordering Events (Logical Clocks)

In a single OS, we use the CPU clock to order events. In a distributed system, physical clocks drift. You cannot trust time.now().

Lamport Logical Clocks

Instead of absolute time, we use a counter to determine Causality (Happen-Before relationship →).

Every process maintains a counter C.
On local event: C = C + 1.
On send message: Include C in the message.
On receive message: C = max(C_local, C_received) + 1.

3. Interactive: Lamport Clock Simulator

Send messages between nodes and see how the logical clock synchronizes.

Node A

Node B

Node C

System Initialized. Clocks at 0.

4. Remote Procedure Calls (RPC)

RPC is the fundamental building block of distributed systems (used in gRPC, Thrift, Avro). It allows a program to call a function on another computer as if it were a local function.

The Flow

Client Stub: Packs arguments into a message (Marshalling).
Network: Transmits the message (TCP/UDP).
Server Stub: Unpacks arguments (Unmarshalling).
Execution: Server runs the function.
Return: Result is marshalled back to client.

Code Example: Building a Simple RPC

Java

package main

import (
    "fmt"
    "net"
    "net/rpc"
)

// 1. Define the Arguments
type Args struct {
    A, B int
}

// 2. Define the Service
type Arith int

func (t *Arith) Multiply(args *Args, reply *int) error {
    *reply = args.A * args.B
    return nil
}

func main() {
    // SERVER
    arith := new(Arith)
    rpc.Register(arith)
    rpc.HandleHTTP()
    l, _ := net.Listen("tcp", ":1234")
    go http.Serve(l, nil)

    // CLIENT
    client, _ := rpc.DialHTTP("tcp", "localhost:1234")

    args := &Args{7, 8}
    var reply int

    // Remote Call
    err := client.Call("Arith.Multiply", args, &reply)

    if err != nil {
        fmt.Println("RPC error:", err)
    } else {
        fmt.Printf("Result: %d * %d = %d\n", args.A, args.B, reply)
    }
}

// Java uses RMI (Remote Method Invocation) or gRPC.
// This is a conceptual example of the "Stub" pattern.

import java.rmi.Remote;
import java.rmi.RemoteException;
import java.rmi.server.UnicastRemoteObject;
import java.rmi.Naming;

// 1. The Interface (Contract)
interface Calculator extends Remote {
    int add(int a, int b) throws RemoteException;
}

// 2. The Server Implementation
class CalculatorImpl extends UnicastRemoteObject implements Calculator {
    protected CalculatorImpl() throws RemoteException { super(); }

    public int add(int a, int b) { return a + b; }

    public static void main(String[] args) {
        try {
            Naming.rebind("CalcService", new CalculatorImpl());
            System.out.println("Server Ready");
        } catch (Exception e) { e.printStackTrace(); }
    }
}

// 3. The Client
class Client {
    public static void main(String[] args) {
        try {
            // Lookup the Stub (Proxy)
            Calculator stub = (Calculator) Naming.lookup("rmi://localhost/CalcService");

            // Call looks local, but goes over network!
            int result = stub.add(5, 10);
            System.out.println("Result: " + result);
        } catch (Exception e) { e.printStackTrace(); }
    }
}

5. Consensus (Paxos & Raft)

How do multiple nodes agree on a single value (e.g., who is the Leader)?

Split Brain: If the network is cut in half, both sides might think they are the leader.
Quorum: To make a decision, you need agreement from N/2 + 1 nodes.
Raft: A consensus algorithm designed to be understandable. It uses Leader Election and Log Replication.

[!WARNING] CAP Theorem: In a distributed system, you can only pick two:

Consistency (Everyone sees the same data).

Availability (System always responds).

Partition Tolerance (System survives network cuts).

Since networks always have partitions (P), you must choose between CP (Consistency) or AP (Availability).

6. Summary

Transparency: Hiding the network from the user.
Logical Clocks: Ordering events without relying on physical time.
RPC: Calling functions across the network.
Consensus: The hard problem of agreement in an unreliable world.