AI Tool Use Systems Engineer
An AI Tool Use Systems Engineer architects, builds, and maintains the complex systems that allow organizations to reliably leverag…
Skill Guide
The practice of instrumenting large language model (LLM) applications to collect, analyze, and act upon data related to their performance, behavior, cost, and output quality across the entire inference lifecycle.
Scenario
You have a Node.js script that calls the OpenAI API for text completion. You need visibility into its performance and cost.
Scenario
Your Retrieval-Augmented Generation (RAG) system answers questions from a knowledge base. You need to detect when answer quality degrades after a change to the chunking strategy or embedding model.
Scenario
Your company provides an LLM-powered SaaS product to multiple clients. Each client's usage must be isolated, cost-tracked, and subject to different content and safety policies.
OpenTelemetry is the vendor-neutral standard for collecting telemetry data. LangSmith and LangFuse are LLM-specific platforms for tracing, evaluation, and debugging chains/agents. Helicone/Portkey are API gateways providing instant cost/latency dashboards. Prometheus+Grafana is the classic stack for metrics and alerting. W&B Weave focuses on ML experiment tracking and evaluation for generative models.
The Three Pillars form the core data model for any observability system. Semantic similarity measures output consistency without ground truth. HITL uses user thumbs-up/down signals to create labeled datasets. Cost attribution models allocate expenses to specific users, features, or prompt templates using logged token counts and model pricing.
Answer Strategy
Structure the answer using the observability pillars. Start with Metrics to confirm the drop and check correlated system changes. Then use Traces to inspect individual bad queries, comparing the old vs. new retrieved context chunks (embeddings). Finally, use Logs to evaluate the final answer quality. Sample answer: 'I'd start by segmenting the engagement drop in our dashboards to confirm it correlates with the model deployment. I'd then trace a sample of low-engagement queries through the new pipeline, comparing the retrieved document chunks against what the old model would have retrieved. A shift in the embedding space could surface irrelevant context. I'd run a batch evaluation on our golden test set to quantify the retrieval quality drop, which would give us a measurable signal to decide if we need to rollback.'
Answer Strategy
Tests pragmatic system design and business acumen. The answer should show prioritization based on risk and cost. Sample answer: 'In a high-volume, real-time translation service, logging every full input/output for monitoring was inflating storage costs by 40%. I implemented a tiered approach: 100% of requests get lightweight metrics and error logs. We then sample 5% of requests for full input/output logging and quality evaluation. For the sampled set, we used a smaller, distilled model to score semantic similarity against a human-evaluated baseline, keeping costs low. This gave us statistically significant quality signals without breaking the bank.'
1 career found
Try a different search term.