Distributed OS
[!NOTE] This module explores the core principles of Distributed OS, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Network is the Computer
A Distributed Operating System manages a collection of independent computers and makes them appear to the users as a single coherent system.
The key challenge is Transparency: The user should not know (or care) which machine executes their code or stores their file.
Fallacies of Distributed Computing
- The network is reliable. (It fails often).
- Latency is zero. (Speed of light is finite).
- Bandwidth is infinite. (Congestion happens).
- The network is secure. (Packets can be sniffed).
2. Ordering Events (Logical Clocks)
In a single OS, we use the CPU clock to order events. In a distributed system, physical clocks drift. You cannot trust time.now().
Lamport Logical Clocks
Instead of absolute time, we use a counter to determine Causality (Happen-Before relationship →).
- Every process maintains a counter
C. - On local event:
C = C + 1. - On send message: Include
Cin the message. - On receive message:
C = max(C_local, C_received) + 1.
3. Interactive: Lamport Clock Simulator
Send messages between nodes and see how the logical clock synchronizes.
4. Remote Procedure Calls (RPC)
RPC is the fundamental building block of distributed systems (used in gRPC, Thrift, Avro). It allows a program to call a function on another computer as if it were a local function.
The Flow
- Client Stub: Packs arguments into a message (Marshalling).
- Network: Transmits the message (TCP/UDP).
- Server Stub: Unpacks arguments (Unmarshalling).
- Execution: Server runs the function.
- Return: Result is marshalled back to client.
Code Example: Building a Simple RPC
package main
import (
"fmt"
"net"
"net/rpc"
)
// 1. Define the Arguments
type Args struct {
A, B int
}
// 2. Define the Service
type Arith int
func (t *Arith) Multiply(args *Args, reply *int) error {
*reply = args.A * args.B
return nil
}
func main() {
// SERVER
arith := new(Arith)
rpc.Register(arith)
rpc.HandleHTTP()
l, _ := net.Listen("tcp", ":1234")
go http.Serve(l, nil)
// CLIENT
client, _ := rpc.DialHTTP("tcp", "localhost:1234")
args := &Args{7, 8}
var reply int
// Remote Call
err := client.Call("Arith.Multiply", args, &reply)
if err != nil {
fmt.Println("RPC error:", err)
} else {
fmt.Printf("Result: %d * %d = %d\n", args.A, args.B, reply)
}
}
// Java uses RMI (Remote Method Invocation) or gRPC.
// This is a conceptual example of the "Stub" pattern.
import java.rmi.Remote;
import java.rmi.RemoteException;
import java.rmi.server.UnicastRemoteObject;
import java.rmi.Naming;
// 1. The Interface (Contract)
interface Calculator extends Remote {
int add(int a, int b) throws RemoteException;
}
// 2. The Server Implementation
class CalculatorImpl extends UnicastRemoteObject implements Calculator {
protected CalculatorImpl() throws RemoteException { super(); }
public int add(int a, int b) { return a + b; }
public static void main(String[] args) {
try {
Naming.rebind("CalcService", new CalculatorImpl());
System.out.println("Server Ready");
} catch (Exception e) { e.printStackTrace(); }
}
}
// 3. The Client
class Client {
public static void main(String[] args) {
try {
// Lookup the Stub (Proxy)
Calculator stub = (Calculator) Naming.lookup("rmi://localhost/CalcService");
// Call looks local, but goes over network!
int result = stub.add(5, 10);
System.out.println("Result: " + result);
} catch (Exception e) { e.printStackTrace(); }
}
}
5. Consensus (Paxos & Raft)
How do multiple nodes agree on a single value (e.g., who is the Leader)?
- Split Brain: If the network is cut in half, both sides might think they are the leader.
- Quorum: To make a decision, you need agreement from
N/2 + 1nodes. - Raft: A consensus algorithm designed to be understandable. It uses Leader Election and Log Replication.
[!WARNING] CAP Theorem: In a distributed system, you can only pick two:
- Consistency (Everyone sees the same data).
- Availability (System always responds).
- Partition Tolerance (System survives network cuts).
Since networks always have partitions (P), you must choose between CP (Consistency) or AP (Availability).
6. Summary
- Transparency: Hiding the network from the user.
- Logical Clocks: Ordering events without relying on physical time.
- RPC: Calling functions across the network.
- Consensus: The hard problem of agreement in an unreliable world.