AI Long-Context Systems Engineer
An AI Long-Context Systems Engineer designs and builds production systems that exploit large context windows (128K-10M+ tokens) in…
Skill Guide
The systematic evaluation of a large language model's ability to accurately retrieve, cite, and maintain internal consistency when processing extensive textual contexts without hallucination or contradiction.
Scenario
A 50-page product specification PDF is loaded. The model is asked a specific technical question whose answer is a single sentence buried on page 42.
Scenario
An LLM-powered research assistant synthesizes information from 10 conflicting internal reports to create a market analysis summary. Each claim must be traceable.
Scenario
A financial services firm deploys an LLM to generate earnings call summaries. A single hallucinated figure can cause regulatory and reputational damage.
Ragas provides end-to-end metrics for RAG faithfulness. LangSmith offers observability and custom evaluator pipelines. BERTScore is used for semantic similarity in citation checking.
NIAH tests pure retrieval fidelity. TruthfulQA evaluates resistance to common misconceptions. MuSiQue stresses consistency across multi-step reasoning.
CoVe is a prompting technique to force models to self-verify claims. FactScore breaks complex answers into atomic facts for granular verification against sources.
Answer Strategy
Use a structured debugging framework: 1) Isolate the failure (is it retrieval, citation, or generation?), 2) Design a targeted test (e.g., a 'needle' code in a haystack of legal text), 3) Implement a verification layer. Sample Answer: 'I'd first run a failure analysis using a sample of hallucinated codes against the source corpus to determine if the error was in retrieval or generation. I'd then implement a two-stage check: first, a retrieval audit ensuring the correct passage is pulled, and second, a post-generation verifier using an NLI model to confirm the generated claim is entailed by the cited source. This pipeline would be integrated as a deployment gate.'
Answer Strategy
Tests stakeholder management and the ability to define nuanced success metrics. Sample Answer: 'For a legal contract analysis tool, we defined faithfulness in tiers: 100% citation accuracy for monetary values and dates (non-negotiable), and semantic faithfulness for clause interpretation (measured by human expert agreement scores). I facilitated a workshop with legal and product teams to align on these thresholds, embedding them into our evaluation dashboard as clear KPIs.'
1 career found
Try a different search term.