AI Observability Engineer
An AI Observability Engineer designs, builds, and maintains monitoring, tracing, and alerting systems purpose-built for AI and ML …
Skill Guide
The implementation of OpenTelemetry's distributed tracing framework, specifically extended with GenAI semantic conventions to capture, correlate, and visualize the execution flow and performance of Large Language Model (LLM) inference calls, retrieval-augmented generation (RAG) pipelines, and multi-agent systems.
Scenario
You have a Python service that calls the OpenAI API. You need to trace the request, see the latency, and log the token usage.
Scenario
Your system retrieves documents from a vector database (Pinecone/Milvus) and passes them as context to an LLM for generation. You need to trace the full pipeline to debug latency or irrelevant context.
Scenario
You oversee a system where multiple specialized AI agents (e.g., a planner, a coder, a reviewer) collaborate via message passing to complete a complex task. Failures are difficult to diagnose.
The OTel SDKs are used for instrumentation. Backends store and query traces. The Collector processes, filters, and routes telemetry data. Specific instrumentors auto-inject semantic conventions for popular LLM providers.
The GenAI semantic conventions are the schema you must follow for consistent, queryable data. W3C Trace Context ensures trace propagation works across HTTP/gRPC boundaries. OTLP is the standard wire format for sending telemetry data to the collector or backend.
Answer Strategy
Structure your answer around the OODA loop (Observe, Orient, Decide, Act) applied to observability. Demonstrate knowledge of both the standard tracing workflow and GenAI-specific signals.
Answer Strategy
This tests leadership, communication, and change management skills. Focus on the 'how' more than the 'what'.
1 career found
Try a different search term.