AI Evaluation Engineer
AI Evaluation Engineers design, build, and operate the measurement infrastructure that determines whether AI systems actually work…
Skill Guide
The systematic design of protocols to assess model outputs or products via human judgment, involving the creation of detailed scoring guides (rubrics), training annotators to apply them consistently, and implementing safeguards to detect and reduce systematic judgment errors.
Scenario
You inherit a 3-point rubric (Good/Okay/Bad) for image captioning with low inter-annotator agreement. The 'Okay' category is vague.
Scenario
A team of 10 new annotators is onboarded for a long-term content moderation project. Initial agreement on a complex policy is only 65%.
Scenario
A human evaluation of an LLM's responses shows potential bias: annotators from Region A consistently rate responses as more 'helpful' than annotators from Region B for the same queries.
IAA metrics quantify consistency. The Rubric Design Framework provides structure for creating clear evaluation guides. The Bias Audit Framework is used to systematically test for and identify skew. The Calibration Loop is the iterative process for aligning annotator understanding.
Use Label Studio/Prodigy for building and hosting custom annotation interfaces with integrated agreement calculation. SageMaker Ground Truth is for scalable managed annotation workflows. Qualtrics is useful for conducting structured annotator feedback surveys and demographic data collection for bias analysis.
Answer Strategy
Use a structured problem-solving framework (Diagnose, Isolate, Remedy, Verify). Sample Answer: 'First, I'd isolate the cause by analyzing disagreements-checking if they cluster on specific rubric criteria, data types, or annotator cohorts. Next, I'd review the rubric and recent data samples for ambiguity or drift. Based on findings, I'd either conduct a targeted calibration session with a new gold set or revise the rubric with clearer anchors and examples. Finally, I'd implement the fix in a controlled pilot, measure the IAA change, and update the standard operating procedure.'
Answer Strategy
Tests bias detection methodology and corrective action. Sample Answer: 'In a sentiment analysis project, I suspected regional bias. I ran a stratified analysis, comparing rating distributions by annotator locale, and confirmed a significant skew (p<0.05) for certain dialects. My mitigation was threefold: 1) I made the rubric more behaviorally anchored to specific lexical cues rather than overall 'feeling'. 2) I implemented a qualification quiz focused on those anchors. 3) I set up an automated dashboard to monitor agreement across demographic slices weekly, allowing for rapid intervention.'
1 career found
Try a different search term.