Load Balancing 101: Primitives for Scale

Load Balancing Hero

Load balancing is the entryway to scalable distributed systems. At its core, it answers a simple question: How do we distribute traffic across multiple servers to ensure reliability and performance?

In this post, we’ll move beyond the basics (“just round robin it”) and explore the architecture of modern load balancers, from high-performance L4 switching to context-aware L7 proxies.

1. Visualizing Round Robin

Before diving into code, let’s visualize the simplest strategy. Round Robin just deals cards to players in order.

Source

2. The Landscape: L4 vs L7

The single most important distinction is the layer at which balancing happens.

Layer 4 (Transport Layer)

Scope: Connection-level (IP + Port).
Performance: Extremely fast (often implemented in kernel/hardware).
Connection Draining: Abrupt. When an L4 LB removes a backend, the TCP connection is severed. This can cause errors for long-lived connections (DBs, WebSockets).
Examples: IPVS, Maglev (Google), Katran (Meta).

Layer 7 (Application Layer)

Scope: Request-level (HTTP Headers, URI, Cookies).
Flexibility: Context-aware routing (e.g., /api/v2 -> Canary Cluster).
Connection Draining: Graceful. The LB can stop sending new requests to a backend but allow in-flight requests to complete.
Examples: Nginx, HAProxy, Envoy.

3. L4 Data Plane Modes

A Staff engineer doesn’t just ask “L4 vs L7”, they ask “Which L4 mode?”. How the packet actually gets to the server determines your throughput ceilings.

Mode	Short for	Return Path	Staff Insight
NAT	Network Address Translation	Through LB	Standard. Easiest to debug but the LB is a bottleneck.
DSR	Direct Server Return	Directly to Client	High Performance. Server responds directly to user, bypassing LB on the return path. Essential for video/media.
TUN	IP Tunneling	Directly to Client	Used for cross-site/cross-DC balancing. LB wraps the packet in another IP header.

[!TIP] DSR (Direct Server Return) is the secret to Google/Netflix efficiency. Since response traffic is usually 10x-100x larger than request traffic, removing the LB from the response path allows a single LB to handle massive traffic.

4. BGP Anycast: Edge Balancing

How does the user find your Load Balancer in the first place?

Anycast allows multiple machines (in different cities) to share the same IP address. The internet’s routing protocol (BGP) automatically sends the user to the “closest” instance.

Benefit: Natural geo-routing and DDoS protection.
Constraint: It is connectionless at the IP level. If the network topology changes mid-request, a TCP connection might “flicker” to a different city and break.

5. Algorithms & Implementation

graph TD User[User Request] subgraph L4_Layer [L4 LB: Connection Scope] L4[L4 Maglev] end subgraph L7_Layer [L7 LB: Request Scope] Proxy1[Envoy Sidecar] end subgraph App_Layer [App Services] S1[Service A] S2[Service B] end User -- TCP Conn --> L4 L4 -- TCP Conn --> Proxy1 Proxy1 -- HTTP Req 1 --> S1 Proxy1 -- HTTP Req 2 --> S2

3. Algorithms & Implementation

Weighted Round Robin (WRR)

Used when servers have different capacities (e.g., Server A is a large instance, Server B is small).

[!WARNING] Memory Warning: The “Flattened List” implementation below is great for understanding but memory-inefficient for large weights (O(sum of weights)). In production, Nginx uses Smooth Weighted Round Robin to avoid “bursty” traffic patterns.

Implementation (Java)

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.atomic.AtomicInteger;

class Server {
    String ip;
    int weight;

    public Server(String ip, int weight) {
        this.ip = ip;
        this.weight = weight;
    }
}

public class WeightedRoundRobin {
    private final List<String> distributionList; 
    private final AtomicInteger currentIndex;

    public WeightedRoundRobin(List<Server> servers) {
        this.distributionList = new ArrayList<>();
        this.currentIndex = new AtomicInteger(0);

        // O(Sum of Weights) space constraint
        for (Server s : servers) {
            for (int i = 0; i < s.weight; i++) {
                distributionList.add(s.ip);
            }
        }
    }

    public String getServer() {
        int index = currentIndex.getAndIncrement() % distributionList.size();
        return distributionList.get(Math.abs(index));
    }
}

Implementation (Go)

package main

import (
	"fmt"
	"sync/atomic"
)

type Server struct {
	IP     string
	Weight int
}

type WeightedRR struct {
	peers   []string
	counter uint64
}

func NewWeightedRR(servers []Server) *WeightedRR {
	peers := []string{}
	// Flatten weights
	for _, s := range servers {
		for i := 0; i < s.Weight; i++ {
			peers = append(peers, s.IP)
		}
	}
	return &WeightedRR{
		peers: peers,
	}
}

func (lb *WeightedRR) Next() string {
	if len(lb.peers) == 0 {
		return ""
	}
	current := atomic.AddUint64(&lb.counter, 1)
	index := (current - 1) % uint64(len(lb.peers))
	return lb.peers[index]
}

6. Staff-Level Deep Dive: eBPF & XDP

Why do Cloudflare and Meta build their own LBs? Because iptables is too slow at scale.

eBPF (Extended Berkeley Packet Filter) allows running sandboxed programs in the Linux kernel. XDP (eXpress Data Path) lets you discard or forward packets effectively at the NIC driver level, bypassing the heavy OS networking stack (sk_buff allocation).

This enables Standard Load Balancers to handle millions of packets per second with minimal CPU usage.