Interview Prep
AI Evaluation Engineer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA strong answer distinguishes standardized public benchmarks (MMLU, HumanEval) from organization-specific test suites built for custom use cases, and explains when each is appropriate.
BLEU measures precision of n-gram overlap (good for translation), ROUGE measures recall of n-gram overlap (good for summarization); mention their limitations for semantic evaluation.
Automated metrics often fail to capture nuance, creativity, factual accuracy, and user preference; human evaluation provides ground truth for calibration of automated systems.
It measures agreement between human annotators (e.g., Cohen's kappa, Krippendorff's alpha); low reliability means the evaluation rubric is ambiguous or annotators need more training.
Contamination occurs when evaluation data leaks into training data, inflating benchmark scores; a great answer mentions deduplication strategies and held-out test sets.
Intermediate
10 questionsCover groundedness checks against retrieved context, factual consistency verification against knowledge bases, and a scoring rubric that separates 'unfaithful to context' from 'factually incorrect'.
Discuss prompt design for evaluation, positional bias, verbosity bias, self-preference bias; mitigations include calibration against human labels, pairwise comparison with position randomization, and ensemble judges.
Mention paired t-tests or Wilcoxon signed-rank tests for per-sample comparisons, bootstrap confidence intervals for aggregate metrics, and the importance of sufficient sample size.
Discuss medical accuracy benchmarks, refusal testing for dangerous medical advice, disclaimers evaluation, demographic bias testing, and alignment with clinical guidelines.
Intrinsic measures model quality on standalone tasks (perplexity, benchmark accuracy); extrinsic measures how the model performs in a downstream application (task completion rate, user satisfaction).
Cover golden test cases with expected outputs, scoring thresholds for pass/fail, automated CI integration, and strategies for handling legitimately changed behavior vs. true regressions.
Discuss rubric-based human evaluation, pairwise preference comparisons, multi-dimensional scoring (coherence, relevance, fluency), and using LLM judges calibrated to human preferences.
Mention task completion rate across turns, conversation coherence, memory retention, recovery from errors, user satisfaction surveys, and turn-level vs. conversation-level metrics.
Discuss controlled experiments varying demographic attributes, toxicity and sentiment scoring disaggregated by group, disparate impact analysis, and intersectional evaluation.
Cover retrieval precision@k, recall@k, MRR, nDCG for retrieval; context relevance and faithfulness for downstream generation; end-to-end answer correctness as the ultimate metric.
Advanced
10 questionsDiscuss task-level success metrics, safety constraint violations, efficiency metrics (steps to completion, cost), partial credit scoring, and the challenge of open-ended state spaces.
Cover human blind pairwise comparisons, Bradley-Terry model, dynamic leaderboard; strengths include real user preferences and contamination resistance; limitations include sampling bias and lack of granular capability breakdown.
Analyze error categories in the domain benchmark, check for contamination in MMLU, compare difficulty distributions, and present the gap as actionable insight with specific failure mode taxonomy.
Design automated constraint checkers for each dimension, weight them by importance, create a composite score, and validate against human judgment on constraint satisfaction.
Discuss models optimizing for evaluation metrics without genuine capability improvement; mitigations include held-out test sets, adversarial evaluation, metric rotation, and human evaluation as anchor.
Discuss real-world code repositories (SWE-bench), code quality metrics (readability, security vulnerabilities, edge case handling), human developer evaluation, and task diversity.
Cover vision-language alignment evaluation, cross-modal grounding accuracy, multimodal hallucination detection, and the challenge of generating reliable automated scores for visual outputs.
Discuss IFEval-style constraint-based evaluation, taxonomy of instruction types (format, content, style, length), automated compliance checking, and stratified analysis by instruction complexity.
Discuss the evaluation-production gap, user preference distribution shifts, missing evaluation dimensions, and the need for continuous calibration between offline evaluation and online metrics.
Discuss causal interventions (changing reasoning steps to see if output changes), comparison with mechanistic interpretability findings, and the philosophical and practical limits of evaluating reasoning faithfulness.
Scenario-Based
10 questionsPrioritize critical dimensions: legal accuracy (hallucination of legal citations), completeness of key clauses, compliance with jurisdiction-specific requirements, and set up both automated and human evaluation tracks.
Check eval suite for metric saturation, prompt template drift, user population shift, evaluation data contamination, and whether the eval captures the dimensions users actually care about.
Design black-box evaluation using standardized prompts, diverse test cases, adversarial probes, and blinded human preference studies; ensure fair comparison conditions (same context, same temperature settings).
Document the vulnerability, classify severity, create a regression test case, coordinate with the ML team for mitigation, verify the fix doesn't break benign behavior, and add to the safety eval suite permanently.
Stratified sampling across intents, LLM-as-judge with calibrated rubrics, automated task completion detection, spot-check with human evaluation for calibration, and dashboard monitoring of per-intent quality metrics.
Statistically significant improvements across relevant benchmarks, fair comparison methodology (same prompts, same system settings), domain-relevant evaluation (not just MMLU), and transparent methodology documentation.
Critical findings must be flagged, medical expert review is mandatory for clinical accuracy, evaluation must include error severity weighting (missing a diagnosis is worse than a style issue), and regulatory compliance checks.
Weight dimensions by business and safety criticality, quantify the magnitude of regression, assess whether the regressed dimensions can be addressed with targeted fixes, and present a risk-quantified recommendation with options.
Audit test cases for relevance to current product, check for contamination with training data, assess metric coverage against current failure modes, retire outdated cases, add coverage for new capabilities, and establish a maintenance cadence.
Discuss self-preference bias of the judge model, verbosity and style bias, cost and latency of using a large judge model, need for human calibration data, and recommend using multiple judge models or a smaller calibrated judge.
AI Workflow & Tools
10 questionsDescribe the EvalRegistry pattern, defining a custom eval class, creating test cases with expected grounded answers, implementing a custom scorer that checks claim-by-claim factual support, and running at scale.
Discuss enabling tracing on the chain, annotating runs with evaluation metadata, using LangSmith's evaluation datasets, running batch evaluations, and analyzing results with the LangSmith dashboard.
Describe running evaluation scripts as a CI step, setting pass/fail thresholds on key metrics, generating evaluation artifacts (reports, visualizations), and blocking deployment on regression detection.
Cover W&B Tables for evaluation result logging, comparison views across runs, artifact versioning for evaluation datasets, and sweeps for systematic prompt optimization with evaluation as the objective.
Describe the Evaluate module structure, implementing _compute() method, registering the metric, combining with existing metrics, and integrating into a larger evaluation pipeline.
Cover generating question-context-answer triples from production logs, computing faithfulness, answer relevance, context precision, and context recall, and setting up continuous evaluation with sampling.
Describe configuring the labeling interface with the rubric, uploading evaluation samples, managing annotators, computing inter-annotator agreement, and exporting results for analysis.
Describe setting up target LLM connectors, configuring attack strategies (prompt injection, jailbreak), running multi-turn adversarial conversations, and analyzing vulnerability reports.
Cover defining metrics (answer relevancy, faithfulness, hallucination), creating test cases, running evaluations via CI or scheduled jobs, and integrating results into alerting and dashboards.
Describe defining a promptfoo config with providers, prompts, test cases, and assertions; running evaluations in parallel across providers; comparing results in the web UI; and using it as a regression testing tool.
Behavioral
5 questionsLook for: systematic evaluation approach that led to the discovery, clear communication of the issue to stakeholders, process for quantifying the impact, and how they ensured the fix was verified.
Look for: evidence-based response, willingness to improve methodology, ability to separate evaluation findings from personal criticism, and collaborative approach to resolving disagreements.
Look for: structured decision-making, risk assessment, communication of uncertainty, and ability to make a defensible call while flagging assumptions.
Look for: proactive learning habits, engagement with research community, practical application of new findings, and ability to distinguish signal from noise in the fast-moving AI space.
Look for: ability to translate technical metrics into business impact, use of visualization and storytelling, clear recommendation tied to business goals, and awareness of the audience's priorities.