OpenTelemetry in Production: Phased Rollout

Complete guide to deploying OpenTelemetry in production with a phased approach: start with distributed tracing, add metrics, integrate logs. Includes sampling strategies, cost optimization, and Kubernetes patterns.

Published: January 2, 2025

•

Updated: January 2, 2025

Why OpenTelemetry (And Why Now)

Before OpenTelemetry, observability was vendor lock-in hell. Want distributed tracing? Pick Jaeger, Zipkin, or Datadog—and instrument your entire codebase with their SDK. Switch vendors? Re-instrument everything.

OpenTelemetry (OTEL) solves this: One SDK, vendor-neutral telemetry. Instrument once, export to any backend (Jaeger, Tempo, Prometheus, Datadog, New Relic, Honeycomb). It's the "POSIX for observability."

Phased Rollout Strategy

Phase 1 (Week 1-2): Traces only—instrument critical paths, set up Tempo/Jaeger
Phase 2 (Week 3-4): Add metrics—correlate with traces, optimize sampling
Phase 3 (Week 5-6): Integrate logs—structured logging with trace context
Ongoing: Cost optimization, custom spans, advanced sampling

Phase 1: Distributed Tracing (The Foundation)

Step 1: Deploy OTEL Collector in Kubernetes

# Install OTEL Collector as DaemonSet helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts helm install otel-collector open-telemetry/opentelemetry-collector \ --namespace observability --create-namespace \ --set mode=daemonset \ --values otel-values.yaml

# otel-values.yaml config: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: timeout: 10s send_batch_size: 1024 # Tail-based sampling (keep only interesting traces) tail_sampling: policies: - name: errors type: status_code status_code: status_codes: [ERROR] - name: slow-requests type: latency latency: threshold_ms: 1000 - name: sample-fast-requests type: probabilistic probabilistic: sampling_percentage: 10 # Keep 10% of fast requests exporters: otlp: endpoint: tempo.observability.svc:4317 tls: insecure: true prometheus: endpoint: "0.0.0.0:8889" service: pipelines: traces: receivers: [otlp] processors: [batch, tail_sampling] exporters: [otlp] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus]

Step 2: Instrument Your Application (Python Example)

# requirements.txt opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests opentelemetry-instrumentation-sqlalchemy opentelemetry-exporter-otlp # app.py from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.instrumentation.flask import FlaskInstrumentor from opentelemetry.instrumentation.requests import RequestsInstrumentor from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor # Initialize tracing trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(__name__) # Export to OTEL Collector otlp_exporter = OTLPSpanExporter( endpoint="otel-collector.observability.svc:4317", insecure=True ) span_processor = BatchSpanProcessor(otlp_exporter) trace.get_tracer_provider().add_span_processor(span_processor) # Auto-instrument Flask app = Flask(__name__) FlaskInstrumentor().instrument_app(app) RequestsInstrumentor().instrument() SQLAlchemyInstrumentor().instrument(engine=db.engine) # Manual instrumentation for critical sections @app.route('/api/order') def create_order(): with tracer.start_as_current_span("create_order") as span: span.set_attribute("user.id", request.user_id) span.set_attribute("order.amount", order_data['amount']) # This will automatically be a child span result = process_payment(order_data) span.set_attribute("payment.status", result.status) return jsonify(result) def process_payment(data): with tracer.start_as_current_span("process_payment"): # Payment logic here pass

Step 3: Deploy Tempo for Trace Storage

# Install Grafana Tempo (cost-effective trace storage) helm repo add grafana https://grafana.github.io/helm-charts helm install tempo grafana/tempo \ --namespace observability \ --set tempo.storage.trace.backend=s3 \ --set tempo.storage.trace.s3.bucket=tempo-traces \ --set tempo.storage.trace.s3.region=us-east-1 # Tempo stores traces in S3 ($0.023/GB vs $1/GB for APM vendors)

Phase 2: Add Metrics (Weeks 3-4)

Once traces are working, add metrics. OTEL can export metrics to Prometheus while maintaining correlation with traces.

# Add metrics instrumentation from opentelemetry import metrics from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader from opentelemetry.exporter.prometheus import PrometheusMetricReader # Initialize metrics prometheus_reader = PrometheusMetricReader() meter_provider = MeterProvider(metric_readers=[prometheus_reader]) metrics.set_meter_provider(meter_provider) meter = metrics.get_meter(__name__) # Create custom metrics order_counter = meter.create_counter( "orders_total", description="Total number of orders", unit="1" ) order_duration = meter.create_histogram( "order_duration_seconds", description="Order processing duration", unit="s" ) # Use metrics in code @app.route('/api/order') def create_order(): start = time.time() with tracer.start_as_current_span("create_order") as span: result = process_order(data) # Record metrics with trace context order_counter.add(1, {"status": result.status, "product": data['product']}) order_duration.record(time.time() - start, {"status": result.status}) return jsonify(result)

Phase 3: Integrate Logs (Weeks 5-6)

The final piece: structured logs with trace correlation. Now you can jump from a trace to related logs instantly.

# Structured logging with trace context import logging from opentelemetry import trace from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler from opentelemetry.sdk._logs.export import BatchLogRecordProcessor from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter # Initialize logging logger_provider = LoggerProvider() logger_provider.add_log_record_processor( BatchLogRecordProcessor(OTLPLogExporter(endpoint="otel-collector:4317")) ) # Configure Python logging handler = LoggingHandler(logger_provider=logger_provider) logging.basicConfig(level=logging.INFO, handlers=[handler]) logger = logging.getLogger(__name__) # Logs automatically include trace context @app.route('/api/order') def create_order(): with tracer.start_as_current_span("create_order") as span: trace_id = span.get_span_context().trace_id # This log will be correlated with the trace logger.info( "Processing order", extra={ "user_id": request.user_id, "order_amount": data['amount'], "trace_id": trace_id # Key for correlation! } ) result = process_payment(data) if result.status == "failed": logger.error( "Payment failed", extra={"reason": result.error, "trace_id": trace_id} ) return jsonify(result)

Sampling Strategies (Critical for Cost Control)

Collecting 100% of traces is expensive and unnecessary. Smart sampling keeps costs low while maintaining visibility.

Head-Based Sampling (Simple)

Decision made at trace start. Sample 10% of all traces randomly.

Pro: Low overhead, easy to implement
Con: Might miss interesting (error) traces

Tail-Based Sampling (Recommended)

Decision made after trace completes. Keep 100% of errors, 100% of slow requests, 10% of fast success.

Pro: Optimal cost/value ratio
Con: Requires buffering traces in collector

# Advanced tail sampling config processors: tail_sampling: decision_wait: 10s # Wait for trace to complete num_traces: 100000 # Buffer size expected_new_traces_per_sec: 1000 policies: # ALWAYS keep errors - name: errors type: status_code status_code: status_codes: [ERROR] # ALWAYS keep slow requests - name: slow-requests type: latency latency: threshold_ms: 1000 # Keep traces for specific critical endpoints - name: critical-endpoints type: string_attribute string_attribute: key: http.route values: ["/api/payment", "/api/checkout"] # Sample 10% of everything else - name: probabilistic type: probabilistic probabilistic: sampling_percentage: 10

Cost Comparison: OTEL vs Commercial APM

┌─────────────────────────┬──────────────┬──────────────────┬────────────────┐ │ Volume │ OTEL (DIY) │ Datadog APM │ New Relic APM │ ├─────────────────────────┼──────────────┼──────────────────┼────────────────┤ │ 10M spans/month │ $5-$20 │ $300-$500 │ $400-$600 │ │ (10 services, 1K RPS) │ (S3 + compute)│ │ │ │ │ │ │ │ │ 100M spans/month │ $50-$150 │ $2,000-$3,000 │ $2,500-$4,000 │ │ (50 services, 10K RPS) │ │ │ │ │ │ │ │ │ │ 1B spans/month │ $300-$800 │ $15,000-$25,000 │ $20,000-$30,000│ │ (200 services, 100K RPS)│ │ │ │ └─────────────────────────┴──────────────┴──────────────────┴────────────────┘ Cost breakdown (OTEL): - Tempo storage (S3): $0.023/GB = ~$50/month for 2TB (100M spans) - Compute (EKS nodes): 2x c5.xlarge = $100/month - Total: ~$150/month vs $2,000+ for commercial APM

Best Practices for Production

1. Start Small, Scale Gradually

Week 1: Instrument 1-2 critical services
Week 2-3: Add 5-10 more services
Week 4+: Roll out to entire fleet

2. Monitor OTEL Collector Health

otelcol_receiver_accepted_spans
otelcol_receiver_refused_spans
otelcol_exporter_sent_spans
otelcol_processor_batch_batch_send_size

3. Use Semantic Conventions

Follow OTEL semantic conventions for span attributes:
http.method, http.status_code, db.system, messaging.destination
This ensures compatibility across tools.

Troubleshooting Common Issues

Issue: Traces not appearing in Tempo

Check OTEL Collector logs for export errors. Verify network connectivity to Tempo.

Solution: kubectl logs -n observability otel-collector-xxx | grep ERROR

Issue: High cardinality causing memory issues

Too many unique span attributes (e.g., user IDs in span names).

Solution: Use attributes, not span names for high-cardinality data.

Conclusion: Observability Without Vendor Lock-In

OpenTelemetry gives you the observability of Datadog at 5-10% of the cost, with zero vendor lock-in. Start with traces (highest ROI), add metrics, integrate logs. Use tail-based sampling to control costs.

Next steps: Instrument your most critical service this week. Export to Tempo. See how distributed tracing changes debugging forever.