Skip to main content

Skill Guide

Distributed tracing with OpenTelemetry adapted for GenAI semantic conventions

The implementation of OpenTelemetry's distributed tracing framework, specifically extended with GenAI semantic conventions to capture, correlate, and visualize the execution flow and performance of Large Language Model (LLM) inference calls, retrieval-augmented generation (RAG) pipelines, and multi-agent systems.

This skill is critical for engineering teams to achieve observability in complex GenAI stacks, directly impacting cost management by tracing token usage and latency hotspots, and enabling reliable performance tuning and debugging of non-deterministic AI systems, which is essential for production-scale deployment.
1 Careers
1 Categories
9.1 Avg Demand
15% Avg AI Risk

How to Learn Distributed tracing with OpenTelemetry adapted for GenAI semantic conventions

1. Master OpenTelemetry core concepts: Traces, Spans, Context Propagation, and the Collector pipeline. 2. Study the GenAI Semantic Conventions specification (e.g., `gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.prompt_tokens`). 3. Practice instrumenting a simple Python Flask/FastAPI application making a single call to an LLM API using the `opentelemetry-sdk` and the `opentelemetry-exporter-otlp` exporter.
1. Instrument a multi-service application (e.g., a RAG pipeline with a vector DB retrieval step and an LLM call) to trace the entire request lifecycle. 2. Implement and practice with common failure scenarios: model timeouts, high token cost alerts, and context window errors. 3. Use a Jaeger or Grafana Tempo backend to analyze traces, focusing on identifying latency bottlenecks and correlating `gen_ai.usage.*` metrics with trace spans. Avoid the mistake of over-instrumenting low-value internal functions.
1. Architect a cross-team observability strategy, defining standards for custom span attributes for business-specific GenAI events (e.g., `app.genai.workflow_type`). 2. Integrate trace data with metrics (Prometheus) and logs (Loki) in a full-stack observability platform (Grafana) for holistic root cause analysis. 3. Develop automated alerting rules based on trace-derived GenAI metrics (e.g., p99 latency for a specific model, cost-per-query anomalies). Mentor teams on semantic convention compliance and trace sampling strategies.

Practice Projects

Beginner
Project

Single-LLM-Call Tracing

Scenario

You have a Python service that calls the OpenAI API. You need to trace the request, see the latency, and log the token usage.

How to Execute
1. Install `opentelemetry-sdk`, `opentelemetry-exporter-otlp`, and `opentelemetry-instrumentation-openai`. 2. Initialize the OTel SDK with a `BatchSpanProcessor` and OTLP exporter pointing to a local Jaeger instance. 3. Decorate or manually instrument the OpenAI API call, ensuring the `gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.prompt_tokens`, and `gen_ai.usage.completion_tokens` attributes are set on the span. 4. Run the service, make a request, and inspect the generated trace in the Jaeger UI.
Intermediate
Project

RAG Pipeline End-to-End Tracing

Scenario

Your system retrieves documents from a vector database (Pinecone/Milvus) and passes them as context to an LLM for generation. You need to trace the full pipeline to debug latency or irrelevant context.

How to Execute
1. Instrument the retrieval service to create a child span for the vector DB query. Tag it with `db.system` and `db.operation`. 2. Pass the trace context to the LLM service using W3C Trace Context headers. 3. In the LLM service, create a new span for the inference call, setting all GenAI semantic conventions. 4. In your tracing backend, filter for the root trace ID to visualize the full tree: Parent (API Request) -> Child (Vector DB Query) -> Child (LLM Inference). Analyze time distribution and link retrieval scores to final generation quality.
Advanced
Project

Multi-Agent System Observability

Scenario

You oversee a system where multiple specialized AI agents (e.g., a planner, a coder, a reviewer) collaborate via message passing to complete a complex task. Failures are difficult to diagnose.

How to Execute
1. Define a custom semantic convention for the agent framework (e.g., `app.agent.id`, `app.agent.role`, `app.agent.message.type`). 2. Implement context propagation across agent message queues (e.g., using message headers for Kafka/RabbitMQ). 3. Create a centralized dashboard that aggregates traces by task ID, showing the sequence of agent invocations, their individual LLM calls, and any tool-use actions. 4. Implement trace-based anomaly detection to flag when an agent's reasoning span (time between receiving a message and sending a response) deviates significantly from baseline, indicating a potential loop or failure.

Tools & Frameworks

Software & Platforms

OpenTelemetry SDK (Python, JS, Go)Jaeger / Grafana Tempo (Trace Backends)Grafana (Visualization)OpenTelemetry Collector (Pipeline)OpenAI/Anthropic SDK Instrumentors

The OTel SDKs are used for instrumentation. Backends store and query traces. The Collector processes, filters, and routes telemetry data. Specific instrumentors auto-inject semantic conventions for popular LLM providers.

Standards & Specifications

OpenTelemetry Semantic Conventions (GenAI)W3C Trace ContextOpenTelemetry Protocol (OTLP)

The GenAI semantic conventions are the schema you must follow for consistent, queryable data. W3C Trace Context ensures trace propagation works across HTTP/gRPC boundaries. OTLP is the standard wire format for sending telemetry data to the collector or backend.

Interview Questions

Answer Strategy

Structure your answer around the OODA loop (Observe, Orient, Decide, Act) applied to observability. Demonstrate knowledge of both the standard tracing workflow and GenAI-specific signals.

Answer Strategy

This tests leadership, communication, and change management skills. Focus on the 'how' more than the 'what'.

Careers That Require Distributed tracing with OpenTelemetry adapted for GenAI semantic conventions

1 career found