Virtualization

[!NOTE] This module explores the core principles of Virtualization, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Machine in the Machine

Virtualization is the art of deceiving an Operating System into believing it owns the hardware, when in reality, it is just a guest in a hotel managed by the Hypervisor (Virtual Machine Monitor - VMM).

It is the foundation of the entire Cloud (AWS EC2, Google Compute Engine). Without efficient virtualization, the modern internet would not exist.

The Problem: Privilege

An OS kernel is designed to run in Ring 0 (Kernel Mode) with full control over the CPU.

  • It expects to manipulate the Page Tables (CR3 register).
  • It expects to handle Interrupts.
  • It expects to execute privileged instructions (HLT, LGDT).

If you run multiple OSs on one CPU, they can’t all be in Ring 0. If Guest A disables interrupts, it would freeze Guest B and the Host.


2. Types of Hypervisors

Type 1: Bare Metal (Native)

The Hypervisor is the Operating System. It boots directly on the hardware.

  • Examples: VMware ESXi, Xen, Microsoft Hyper-V.
  • Performance: High. The VMM has direct access to hardware drivers.
  • Use Case: Enterprise Data Centers, Cloud Providers.

Type 2: Hosted

The Hypervisor runs as a software application on top of a standard Host OS.

  • Examples: VirtualBox, VMware Workstation, QEMU.
  • Performance: Lower. I/O requests must pass through the Guest OS → VMM → Host OS → Hardware.
  • Use Case: Developers testing code, running Linux on Windows.

[!NOTE] KVM (Kernel-based Virtual Machine) is a hybrid. It turns the Linux Kernel into a Type 1 Hypervisor using a kernel module, allowing it to spawn VMs as regular processes.


3. Hardware Assist (Intel VT-x / AMD-V)

In the old days (pre-2005), virtualization was done via Binary Translation (rewriting Guest OS code on the fly to replace privileged instructions). This was slow and complex.

Modern CPUs solve this with Hardware Assist.

  • VMX Root Mode: Where the Hypervisor runs (Ring 0).
  • VMX Non-Root Mode: Where the Guest OS runs.

Crucially, the Guest OS thinks it is in Ring 0, but it is actually in a constrained “Guest Ring 0”.

The VM Exit

When a Guest tries to do something dangerous (like changing the Page Table or accessing hardware), the CPU triggers a VM Exit.

  1. CPU pauses the Guest (Non-Root Mode).
  2. Context switches to the Hypervisor (Root Mode).
  3. Hypervisor handles the request (emulates the hardware).
  4. Hypervisor executes VMRESUME to switch back to the Guest (VM Entry).

4. Interactive: Ring Transition Simulator

Visualize the transition between Guest Mode and Host Mode (Root).

VMX Root Mode (Host)
Hypervisor (KVM)
Handles Exits Emulates I/O
VM Entry →
← VM Exit
VMX Non-Root Mode
Guest OS
Executes Instructions Thinks it's Ring 0
Guest is Running...

5. Memory Virtualization (EPT / SLAT)

The Guest OS thinks it manages physical memory using its own Page Tables.

  • Guest Virtual Address (GVA)Guest Physical Address (GPA).

But the Hardware uses Host Physical Addresses (HPA). Before hardware support, the Hypervisor had to maintain Shadow Page Tables (mapping GVA → HPA directly) in software. Every time the Guest changed its page table, the Hypervisor had to trap and update the shadow table. This was incredibly expensive.

Solution: Extended Page Tables (EPT) / SLAT The CPU hardware walks two layers of page tables:

  1. Guest CR3: GVA → GPA.
  2. Host EPT Pointer: GPA → HPA. This eliminates the need for Shadow Page Tables and VM Exits on page faults.

6. Code Example: The Hypervisor Loop

How does a Hypervisor actually work in code? It’s essentially an infinite loop that runs the CPU until it exits.

Go
Java
package main

import (
    "fmt"
    "syscall"
    "unsafe"
)

// Conceptual KVM interaction in Go
// Real KVM requires complex ioctl handling with Cgo or pure Go syscalls
func main() {
    // 1. Open KVM device
    kvm, _ := syscall.Open("/dev/kvm", syscall.O_RDWR, 0)

    // 2. Create a Virtual Machine
    vmFd, _, _ := syscall.Syscall(syscall.SYS_IOCTL, uintptr(kvm),
        KVM_CREATE_VM, 0)

    // 3. Create a VCPU (Virtual CPU)
    vcpuFd, _, _ := syscall.Syscall(syscall.SYS_IOCTL, vmFd,
        KVM_CREATE_VCPU, 0)

    // 4. Map memory for the VM (Guest RAM)
    // ... mmap() logic here ...

    fmt.Println("Starting VCPU Loop...")

    for {
        // 5. Run the VCPU (Enters VMX Non-Root Mode)
        // This blocks until a VM Exit occurs
        syscall.Syscall(syscall.SYS_IOCTL, vcpuFd, KVM_RUN, 0)

        // 6. Handle VM Exit (Check exit reason in shared memory)
        // reason := kvmRunStruct.exit_reason
        // if reason == KVM_EXIT_IO { handleIO() }
        // if reason == KVM_EXIT_HLT { break }

        fmt.Println("VM Exit handled, resuming...")
    }
}

const (
    KVM_CREATE_VM   = 0xAE01
    KVM_CREATE_VCPU = 0xAE41
    KVM_RUN         = 0xAE80
)
public class HypervisorSimulation {

    // Simulating the CPU State
    static class VCPU {
        int[] registers = new int[4]; // EAX, EBX, etc.
        boolean running = true;

        void run() {
            while (running) {
                // Fetch decode execute...
                Instruction instr = fetch();

                if (instr.isPrivileged()) {
                    // TRAP! Return control to Hypervisor
                    throw new VmExitException("PRIVILEGED_INSTR");
                } else {
                    execute(instr);
                }
            }
        }

        // Mock methods
        Instruction fetch() { return new Instruction("MOV"); }
        void execute(Instruction i) { /* ... */ }
    }

    public static void main(String[] args) {
        VCPU guestCpu = new VCPU();
        System.out.println("Hypervisor: Starting Guest...");

        while (guestCpu.running) {
            try {
                // "VM Entry" - Switch to Guest Mode
                guestCpu.run();
            } catch (VmExitException e) {
                // "VM Exit" - Handle the Trap
                System.out.println("VM EXIT: " + e.getMessage());

                if (e.getMessage().equals("PRIVILEGED_INSTR")) {
                    handlePrivileged(guestCpu);
                }
            }
        }
    }

    static void handlePrivileged(VCPU cpu) {
        // Emulate the instruction or terminate
        System.out.println("Hypervisor: Emulating instruction...");
    }
}

class VmExitException extends RuntimeException {
    public VmExitException(String reason) { super(reason); }
}

class Instruction {
    String op;
    public Instruction(String op) { this.op = op; }
    public boolean isPrivileged() { return false; } // Mock
}

[!TIP] Virtio: Instead of emulating a real network card (Intel e1000) which is slow (lots of VM Exits for every packet), we use Paravirtualization. The Guest OS knows it’s a VM and uses a special “Virtio” driver to talk directly to the Hypervisor using shared memory rings, bypassing expensive emulation.


7. Summary

  • Type 1 vs Type 2: Bare metal vs Hosted.
  • Hardware Assist: VMX Root (Hypervisor) and VMX Non-Root (Guest) modes eliminate binary translation.
  • VM Exit: The mechanism for the CPU to trap to the Hypervisor when the Guest tries to touch hardware.
  • EPT: Hardware-accelerated memory translation (GVA → GPA → HPA).