Module Review: Sampling Strategies

Key Takeaways

  • Cost vs Visibility: Observability is a trade-off. 100% sampling at scale is often cost-prohibitive due to storage and network costs.
  • Head Sampling: Happens at the origin (SDK/Agent). It’s cheap and efficient but naive—you might drop important error traces.
  • Tail Sampling: Happens at the collector. It allows “smart” decisions (keep all errors, keep slow requests) but is resource-intensive (memory/CPU).
  • Architecture: Tail sampling requires a 2-Tier Collector Architecture. A Load Balancing Exporter (Tier 1) ensures all spans for a TraceID reach the same Sampler (Tier 2).
  • Consistency: parentbased sampling ensures that if a parent service samples a trace, all downstream children also sample it, preserving trace completeness.

Flashcards

Head Sampling

Where does the sampling decision happen?

At the Source (SDK/Agent)

The decision is made before the span is even exported. This saves network bandwidth and processing power but risks dropping interesting traces.

Tail Sampling Challenge

What architectural component is required for Tail Sampling to work correctly in a distributed cluster?

Load Balancing Exporter

It ensures that all spans belonging to the same TraceID are routed to the same collector instance, allowing a complete decision to be made.

ParentBased Sampling

Why is `parentbased_traceidratio` preferred over simple `traceidratio`?

Trace Integrity

It respects the sampling decision of the upstream caller. If service A samples a request, Service B will also sample it, ensuring you don't get "broken" partial traces.

Sampling Policies

Name three common policies used in Tail Sampling.

Latency, Error, Probabilistic

Keep traces longer than X ms (Latency), keep traces with status=ERROR (Error), and keep a random % of the rest (Probabilistic).

Cost Driver

What is the primary driver of Observability costs?

Volume (Data Ingestion)

Storage and network transfer costs scale linearly with the number of spans/logs generated. Sampling is the primary lever to control this.

Remote Sampling

What is the main benefit of using Remote Sampling (Jaeger Style)?

Dynamic Configuration

It allows the SDK to poll the Collector for sampling rates, enabling you to change rates per-service without redeploying the application.

Cheat Sheet

Decision Matrix: Which Strategy?

Need Strategy Trade-off
Lowest Cost Head Sampling (1%) Misses rare errors
Keep All Errors Tail Sampling Higher RAM/CPU usage
Slow Requests Tail Sampling Requires trace buffering
Dynamic Control Remote Sampling Complex setup (Jaeger)

Head Sampling (Java Agent)

OTEL_TRACES_SAMPLER=parentbased_traceidratio OTEL_TRACES_SAMPLER_ARG=0.1 (10%) OTEL_TRACES_SAMPLER=always_on (100%)

Tail Sampling (Collector Config)

processors:
  tail_sampling:
  decision_wait: 10s
  num_traces: 50000
  expected_new_traces_per_sec: 1000
  policies:
  - name: errors
      type: status_code
      status_code: {status_codes: [ERROR]}
  - name: slow
      type: latency
      latency: {threshold_ms: 1000}
  - name: random
      type: probabilistic
      probabilistic: {sampling_percentage: 1}

Next Steps

Now that you’ve mastered Sampling, you have a complete, production-ready Observability stack.