Metrics: Counters, Gauges & Histograms

Part 5 of the OpenTelemetry Masterclass

Traces tell you what happened to a single request. Metrics tell you what is happening across your entire system right now.

If Tracing is the microscope, Metrics are the cockpit dashboard. In this module, we will implement the three core metric instruments—Counters, Gauges, and Histograms—in both Java and Go, export them to Prometheus, and visualize them in Grafana.

1. The Three Pillars of Metrics

Before writing code, let’s visualize how these instruments behave.

📊

Metric Instrument Simulator

LIVE

Counter

Total Requests

Gauge

Active Connections

Histogram

p50

p90

p99

Latency Distribution

2. Counters: Counting Events

A Counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.

[!TIP] Use Counters for values that accumulate over time: requests served, tasks completed, errors occurred.

Implementation

// 1. Get the Meter
Meter meter = GlobalOpenTelemetry.getMeter("order-service");

// 2. Create the Counter
LongCounter orderCounter = meter.counterBuilder("orders.created")
  .setDescription("Total number of orders created")
  .setUnit("1")
  .build();

// 3. Record data
public void createOrder(Order order) {
  // ... logic ...
  orderCounter.add(1, Attributes.of(
    AttributeKey.stringKey("type"), order.getType()
  ));
}

// 1. Get the Meter
meter := otel.Meter("order-service")

// 2. Create the Counter
orderCounter, err := meter.Int64Counter(
  "orders.created",
  metric.WithDescription("Total number of orders created"),
  metric.WithUnit("1"),
)

// 3. Record data
func CreateOrder(ctx context.Context, order Order) {
  // ... logic ...
  orderCounter.Add(ctx, 1, metric.WithAttributes(
    attribute.String("type", order.Type),
  ))
}

3. Gauges: Measuring State

A Gauge measures a value that can go up and down. It captures the current state.

[!NOTE] Unlike counters, gauges are not rate-aggregated. You usually want the last value or an average over time.

Implementation

// Asynchronous Gauge (Recommended)
// The callback is invoked whenever metrics are collected
meter.gaugeBuilder("db.connections.active")
  .setDescription("Current active database connections")
  .setUnit("1")
  .buildWithCallback(measurement -> {
    measurement.record(dbPool.getActiveCount());
  });

// Asynchronous Gauge
// Note: In Go, we use ObserveInt64 for callbacks
_, err := meter.Int64ObservableGauge(
  "db.connections.active",
  metric.WithDescription("Current active database connections"),
  metric.WithUnit("1"),
  metric.WithInt64Callback(func(ctx context.Context, o metric.Int64Observer) error {
    val := dbPool.GetActiveCount()
    o.Observe(int64(val))
    return nil
  }),
)

4. Histograms: Analyzing Distributions

A Histogram aggregates values into “buckets” to calculate percentiles (p50, p95, p99). This is critical for latency.

[!IMPORTANT] Histogram vs Summary: OpenTelemetry focuses on Histograms because they can be aggregated across multiple instances (e.g., 5 pods serving traffic). Summaries cannot be mathematically merged.

Implementation

// Create Histogram
DoubleHistogram latencyHistogram = meter.histogramBuilder("http.server.duration")
  .setDescription("Incoming request duration")
  .setUnit("ms")
  .build();

// Record Value
long startTime = System.currentTimeMillis();
try {
  processRequest();
} finally {
  double duration = System.currentTimeMillis() - startTime;
  latencyHistogram.record(duration, Attributes.of(
    AttributeKey.stringKey("route"), "/api/checkout"
  ));
}

// Create Histogram
latencyHistogram, err := meter.Float64Histogram(
  "http.server.duration",
  metric.WithDescription("Incoming request duration"),
  metric.WithUnit("ms"),
)

// Record Value
start := time.Now()
defer func() {
  duration := float64(time.Since(start).Milliseconds())
  latencyHistogram.Record(ctx, duration, metric.WithAttributes(
    attribute.String("route", "/api/checkout"),
  ))
}()
ProcessRequest()

5. The Cardinality Explosion

This is the number one mistake developers make with metrics.

Cardinality refers to the number of unique combinations of metric names and attribute values.

[!WARNING] Never use high-cardinality data as attributes. If you add user_id (1 million users) to a metric, you create 1 million unique time series. This will crash your Prometheus server and cost a fortune in cloud monitoring.

The Golden Rule: Attributes should be bounded (enums, status codes, regions). Logs/Traces are for unbounded data (IDs, exact errors).

Attribute	Cardinality	Safety
`http.status_code`	Low (~10-50)	✅ Safe
`region`	Low (~20)	✅ Safe
`order_type`	Low (~5)	✅ Safe
`customer_id`	High (Millions)	❌ DANGER
`order_id`	High (Billions)	❌ DANGER
`error_message`	Infinite	❌ DANGER

6. Exporting to Prometheus

Metrics are useless if they stay in your application. We need to export them.

Data Pipeline

graph LR
subgraph App[Application]
Code[Instrumentation Code] --> SDK[OTel SDK]
SDK --> Exp[Prometheus Exporter]
end
Exp -- Scrape (HTTP /metrics) --> Prom[Prometheus Server]
Prom -- Query (PromQL) --> Graf[Grafana]
style App fill:var(--bg-card),stroke:var(--border-light)
style Prom fill:#e6522c,stroke:#333,color:white
style Graf fill:#f46800,stroke:#333,color:white

Configuration (docker-compose)

Ensure your app exposes the metrics endpoint (usually port 9464 or /metrics on your main port).

# prometheus.yml
scrape_configs:
  - job_name: 'payment-service'
  scrape_interval: 10s
  static_configs:
    - targets: ['payment-service:9464']

7. Visualizing in Grafana

Once data is in Prometheus, use PromQL to visualize it.

Common Queries

1. Request Rate (Requests per second)

sum(rate(http_server_duration_count[5m])) by (route)

Calculates the per-second rate averaged over 5 minutes, grouped by route.

2. 99th Percentile Latency (p99)

histogram_quantile(0.99, sum(rate(http_server_duration_bucket[5m])) by (le, route))

Approximates the p99 latency using histogram buckets.

3. Error Ratio

sum(rate(http_server_duration_count{status_code=~"5.*"}[5m]))
/
sum(rate(http_server_duration_count[5m]))

Divides error rate by total rate to get percentage.

8. Summary

Instrument	Behavior	Use Case
Counter	Only goes up (monotonic)	Total requests, Errors, Tasks completed
Gauge	Goes up and down	Memory usage, Queue size, Thread count
Histogram	Buckets values	Latency, Request size, Response size

In the next module, we will explore Structured Logging and how to correlate logs with the traces and metrics we’ve built.