Module Review: Sampling Strategies
Key Takeaways
- Cost vs Visibility: Observability is a trade-off. 100% sampling at scale is often cost-prohibitive due to storage and network costs.
- Head Sampling: Happens at the origin (SDK/Agent). It’s cheap and efficient but naive—you might drop important error traces.
- Tail Sampling: Happens at the collector. It allows “smart” decisions (keep all errors, keep slow requests) but is resource-intensive (memory/CPU).
- Architecture: Tail sampling requires a 2-Tier Collector Architecture. A Load Balancing Exporter (Tier 1) ensures all spans for a TraceID reach the same Sampler (Tier 2).
- Consistency:
parentbasedsampling ensures that if a parent service samples a trace, all downstream children also sample it, preserving trace completeness.
Flashcards
Head Sampling
Where does the sampling decision happen?
At the Source (SDK/Agent)
The decision is made before the span is even exported. This saves network bandwidth and processing power but risks dropping interesting traces.
Tail Sampling Challenge
What architectural component is required for Tail Sampling to work correctly in a distributed cluster?
Load Balancing Exporter
It ensures that all spans belonging to the same TraceID are routed to the same collector instance, allowing a complete decision to be made.
ParentBased Sampling
Why is `parentbased_traceidratio` preferred over simple `traceidratio`?
Trace Integrity
It respects the sampling decision of the upstream caller. If service A samples a request, Service B will also sample it, ensuring you don't get "broken" partial traces.
Sampling Policies
Name three common policies used in Tail Sampling.
Latency, Error, Probabilistic
Keep traces longer than X ms (Latency), keep traces with status=ERROR (Error), and keep a random % of the rest (Probabilistic).
Cost Driver
What is the primary driver of Observability costs?
Volume (Data Ingestion)
Storage and network transfer costs scale linearly with the number of spans/logs generated. Sampling is the primary lever to control this.
Remote Sampling
What is the main benefit of using Remote Sampling (Jaeger Style)?
Dynamic Configuration
It allows the SDK to poll the Collector for sampling rates, enabling you to change rates per-service without redeploying the application.
Cheat Sheet
Decision Matrix: Which Strategy?
| Need | Strategy | Trade-off |
|---|---|---|
| Lowest Cost | Head Sampling (1%) | Misses rare errors |
| Keep All Errors | Tail Sampling | Higher RAM/CPU usage |
| Slow Requests | Tail Sampling | Requires trace buffering |
| Dynamic Control | Remote Sampling | Complex setup (Jaeger) |
Head Sampling (Java Agent)
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 (10%)
OTEL_TRACES_SAMPLER=always_on (100%)
Tail Sampling (Collector Config)
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 1000
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 1000}
- name: random
type: probabilistic
probabilistic: {sampling_percentage: 1}
Next Steps
Now that you’ve mastered Sampling, you have a complete, production-ready Observability stack.
- Review: Go back to Module 02: Zero to Tracing to refresh the basics.
- Glossary: Check the OpenTelemetry Glossary for definitions.