Observability 101: The Vital Signs

Imagine you are a doctor. A patient walks in and says, “I don’t feel good.”

To help them, you need data:

The Pulse: Is it beating at a normal speed? (Metrics)
The History: What did they eat yesterday? (Logs)
The X-Ray: Exactly where is the pain coming from? (Traces)

In System Design, we call this Observability. It’s how we look inside a complex machine to see why it’s behaving badly.

1. Metrics: The Pulse of the System

Metrics are numbers that tell you How Much or How Fast.

CPU Usage: How hard is the computer working? (0% to 100%).
Error Rate: How many people are getting “Error 500” pages?
Latency: How many milliseconds does it take to load the homepage?

[!TIP] Metrics are for Dashboards. You look at a graph of the “Pulse” to see if something is wrong right now. If the line suddenly spikes, the patient is in trouble.

2. Logs: The Flight Recorder

Logs are a diary of every single event that happened.

[12:01:05] User #42 logged in.
[12:01:10] Database query: SELECT * FROM users took 500ms.
[12:01:15] ERROR: Failed to process payment for User #42.

If the “Pulse” (Metrics) tells you the patient is sick, the Logs tell you exactly what they were doing at the moment they felt the pain.

3. Traces: The X-Ray

In modern systems, a single request (like “Buy a Pizza”) might travel through 10 different servers.

A Trace follows that request from start to finish.

Handoff 1: Website -> Load Balancer (10ms)
Handoff 2: Load Balancer -> Inventory App (50ms)
Handoff 3: Inventory App -> Database (500ms) <— Found the problem!

Traces show you exactly where the “bottleneck” is in a long chain of events.

4. Why Does This Matter?

Without Observability, you are “flying blind.” When a user says, “The site is slow,” you have no way to know if it’s the network, the database, or a bug in your code.

As we scale to millions of users, we can’t look at one server at a time. We need aggregate “Vital Signs” to keep the system healthy.

Beginner’s Checklist

Do I understand that Metrics are for numbers/graphs?
Do I understand that Logs are for specific events/errors?
Do I understand that Traces follow a journey across many servers?

In the next chapter, we’ll look at Advanced Observability, where we learn how to handle too much data using Sampling.