Containers
[!NOTE] This module explores the core principles of Containers, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. Isolation Without Emulation
A Virtual Machine (VM) virtualizes the Hardware. It boots a full Guest Kernel on top of the Host Kernel. This is heavy (GBs of RAM, minutes to boot).
A Container virtualizes the Operating System. All containers share the single Host Kernel. This is lightweight (MBs of RAM, milliseconds to boot).
To the kernel, a container is just a normal Linux process that has been:
- Lied to about what it can see (Namespaces).
- Limited in what it can use (Cgroups).
- Tricked into thinking it has its own filesystem (UnionFS).
2. Namespaces (The Lie)
Namespaces partition global system resources. A process in a namespace thinks it is the only process on the system.
| Namespace | What it Isolates | Result |
|---|---|---|
| PID | Process IDs | The container sees itself as PID 1. It cannot see host processes. |
| NET | Network Stack | The container gets its own eth0, IP, and Loopback. |
| MNT | Mount Points | The container has its own /, /proc, /tmp. |
| UTS | Hostname | The container can have a different hostname than the host. |
| IPC | IPC Queues | Cannot write to shared memory of other containers. |
| USER | User IDs | Root (UID 0) inside the container is nobody (UID 65534) outside. |
Interactive: Namespace Explorer
Toggle the namespaces to see how the process view changes.
3. Cgroups (The Limit)
If Namespaces are about visibility, Control Groups (Cgroups) are about accounting. Without Cgroups, a single container could consume 100% of the CPU or RAM, causing the Host Kernel to OOM Kill critical system services (like SSH).
Cgroups v2 Features:
- cpu.max: Limit CPU time (e.g., 50% of one core).
- memory.max: Limit RAM usage (e.g., 512MB). If exceeded → OOM Kill inside the container.
- io.max: Throttle Disk read/write IOPS.
4. UnionFS / OverlayFS (The Trick)
Containers start instantly because they don’t copy the whole OS. They rely on Copy-on-Write (CoW).
An image is composed of read-only layers. When a container starts, OverlayFS adds a thin Upper Layer (Read-Write) on top.
- Read: Look in Upper. If missing, look in Lower (Image).
- Write: Copy the file from Lower to Upper, then modify it (CoW).
- Delete: Create a “Whiteout” file in Upper that hides the file in Lower.
5. Code Example: Creating a Container
How does Docker actually create a container? It makes a clone() syscall with specific flags.
package main
import (
"fmt"
"os"
"os/exec"
"syscall"
)
// usage: go run main.go run /bin/bash
func main() {
switch os.Args[1] {
case "run":
run()
case "child":
child()
default:
panic("bad command")
}
}
func run() {
// 1. Prepare the command to run itself as "child"
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
// 2. Set the isolation flags (Namespaces)
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS |
syscall.CLONE_NEWPID |
syscall.CLONE_NEWNS,
}
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
// 3. Start the isolated process
if err := cmd.Run(); err != nil {
fmt.Println("Error:", err)
os.Exit(1)
}
}
func child() {
// We are now inside the namespace!
fmt.Printf("Running %v as PID %d\n", os.Args[2:], os.Getpid())
// 4. Set Hostname (Proof of UTS Namespace)
syscall.Sethostname([]byte("container-demo"))
// 5. Chroot (The Jail)
syscall.Chroot("/path/to/rootfs")
syscall.Chdir("/")
// 6. Mount proc (Proof of PID Namespace isolation)
syscall.Mount("proc", "proc", "proc", 0, "")
// 7. Execute the user command (/bin/bash)
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Run()
}
// Java cannot make raw syscalls like clone(CLONE_NEWPID) directly.
// It relies on JNI or external tools (runc, docker).
// However, we can simulate the process creation logic.
import java.io.IOException;
public class ContainerRuntime {
public static void main(String[] args) throws IOException, InterruptedException {
// In Java, we typically wrap a command that does the namespace work
// e.g., calling "unshare" (a linux util to create namespaces)
ProcessBuilder pb = new ProcessBuilder(
"unshare", // Linux utility
"--pid", // New PID NS
"--net", // New Network NS
"--mount", // New Mount NS
"--fork", // Fork the process
"python3", "app.py" // The command to run inside
);
// Inherit IO
pb.inheritIO();
System.out.println("Starting containerized process...");
Process p = pb.start();
int exitCode = p.waitFor();
System.out.println("Container exited with code " + exitCode);
}
}
[!IMPORTANT] Security Warning: Standard containers are not hard security boundaries. A bug in the Linux Kernel allows a container to escape to the host. For hard isolation, use Sandboxed Containers like gVisor (User-space kernel) or Kata Containers (MicroVMs).
6. Summary
- Namespaces: Provide Isolation (What you can see).
- Cgroups: Provide Limitation (What you can use).
- OverlayFS: Provides Efficiency (Copy-on-Write).
- OCI Runtime: The standard spec. runc is the reference implementation that interacts with the kernel to spawn containers.