Containers

[!NOTE] This module explores the core principles of Containers, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Isolation Without Emulation

A Virtual Machine (VM) virtualizes the Hardware. It boots a full Guest Kernel on top of the Host Kernel. This is heavy (GBs of RAM, minutes to boot).

A Container virtualizes the Operating System. All containers share the single Host Kernel. This is lightweight (MBs of RAM, milliseconds to boot).

To the kernel, a container is just a normal Linux process that has been:

  1. Lied to about what it can see (Namespaces).
  2. Limited in what it can use (Cgroups).
  3. Tricked into thinking it has its own filesystem (UnionFS).

2. Namespaces (The Lie)

Namespaces partition global system resources. A process in a namespace thinks it is the only process on the system.

Namespace What it Isolates Result
PID Process IDs The container sees itself as PID 1. It cannot see host processes.
NET Network Stack The container gets its own eth0, IP, and Loopback.
MNT Mount Points The container has its own /, /proc, /tmp.
UTS Hostname The container can have a different hostname than the host.
IPC IPC Queues Cannot write to shared memory of other containers.
USER User IDs Root (UID 0) inside the container is nobody (UID 65534) outside.

Interactive: Namespace Explorer

Toggle the namespaces to see how the process view changes.

Isolation Flags
Process View (`ps aux`, `ifconfig`)
PID List: 1 systemd (Host) 402 nginx (Host) 881 docker-daemon (Host) 10452 python (Self)
Network: eth0: 192.168.1.50 (Host LAN) lo: 127.0.0.1
Filesystem: /var/log/syslog (Host Logs) /home/user/ (Host Files)

3. Cgroups (The Limit)

If Namespaces are about visibility, Control Groups (Cgroups) are about accounting. Without Cgroups, a single container could consume 100% of the CPU or RAM, causing the Host Kernel to OOM Kill critical system services (like SSH).

Cgroups v2 Features:

  • cpu.max: Limit CPU time (e.g., 50% of one core).
  • memory.max: Limit RAM usage (e.g., 512MB). If exceeded → OOM Kill inside the container.
  • io.max: Throttle Disk read/write IOPS.

4. UnionFS / OverlayFS (The Trick)

Containers start instantly because they don’t copy the whole OS. They rely on Copy-on-Write (CoW).

An image is composed of read-only layers. When a container starts, OverlayFS adds a thin Upper Layer (Read-Write) on top.

  • Read: Look in Upper. If missing, look in Lower (Image).
  • Write: Copy the file from Lower to Upper, then modify it (CoW).
  • Delete: Create a “Whiteout” file in Upper that hides the file in Lower.

5. Code Example: Creating a Container

How does Docker actually create a container? It makes a clone() syscall with specific flags.

Go
Java
package main

import (
    "fmt"
    "os"
    "os/exec"
    "syscall"
)

// usage: go run main.go run /bin/bash
func main() {
    switch os.Args[1] {
    case "run":
        run()
    case "child":
        child()
    default:
        panic("bad command")
    }
}

func run() {
    // 1. Prepare the command to run itself as "child"
    cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)

    // 2. Set the isolation flags (Namespaces)
    cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUTS |
                   syscall.CLONE_NEWPID |
                   syscall.CLONE_NEWNS,
    }

    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    // 3. Start the isolated process
    if err := cmd.Run(); err != nil {
        fmt.Println("Error:", err)
        os.Exit(1)
    }
}

func child() {
    // We are now inside the namespace!
    fmt.Printf("Running %v as PID %d\n", os.Args[2:], os.Getpid())

    // 4. Set Hostname (Proof of UTS Namespace)
    syscall.Sethostname([]byte("container-demo"))

    // 5. Chroot (The Jail)
    syscall.Chroot("/path/to/rootfs")
    syscall.Chdir("/")

    // 6. Mount proc (Proof of PID Namespace isolation)
    syscall.Mount("proc", "proc", "proc", 0, "")

    // 7. Execute the user command (/bin/bash)
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr
    cmd.Run()
}
// Java cannot make raw syscalls like clone(CLONE_NEWPID) directly.
// It relies on JNI or external tools (runc, docker).
// However, we can simulate the process creation logic.

import java.io.IOException;

public class ContainerRuntime {
    public static void main(String[] args) throws IOException, InterruptedException {
        // In Java, we typically wrap a command that does the namespace work
        // e.g., calling "unshare" (a linux util to create namespaces)

        ProcessBuilder pb = new ProcessBuilder(
            "unshare", // Linux utility
            "--pid",   // New PID NS
            "--net",   // New Network NS
            "--mount", // New Mount NS
            "--fork",  // Fork the process
            "python3", "app.py" // The command to run inside
        );

        // Inherit IO
        pb.inheritIO();

        System.out.println("Starting containerized process...");
        Process p = pb.start();

        int exitCode = p.waitFor();
        System.out.println("Container exited with code " + exitCode);
    }
}

[!IMPORTANT] Security Warning: Standard containers are not hard security boundaries. A bug in the Linux Kernel allows a container to escape to the host. For hard isolation, use Sandboxed Containers like gVisor (User-space kernel) or Kata Containers (MicroVMs).


6. Summary

  • Namespaces: Provide Isolation (What you can see).
  • Cgroups: Provide Limitation (What you can use).
  • OverlayFS: Provides Efficiency (Copy-on-Write).
  • OCI Runtime: The standard spec. runc is the reference implementation that interacts with the kernel to spawn containers.