Interview Prep
AI Testing Engineer Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsDiscuss non-determinism, probabilistic outputs, and the need for evaluation metrics vs. exact string matching.
Define the vulnerability (manipulating a model via input) and suggest a test with adversarial prompts.
Define it as generating plausible but factually incorrect or unsupported information.
Because inputs/outputs are non-deterministic, you need to trace exact prompts, parameters, and outputs for debugging.
Mention 'Faithfulness' (is answer supported by context) and 'Relevance' (does it answer the question).
Intermediate
10 questionsOutline stages: define user journeys, design prompt templates for test cases, choose evaluation metrics (both automated and human), plan for adversarial testing.
Explain creating a structured test set with variations, using counterfactual testing, and analyzing results with fairness metrics.
Using a stronger LLM (like GPT-4) to grade outputs of a weaker model. Limitations include cost, bias in the judge model, and circular dependency.
Discuss sourcing data from multiple perspectives, using synthetic data augmentation, and regularly reviewing test sets for gaps.
A regression is a degradation in performance. Causes: model update, prompt template change, data drift, or a change in the downstream vector database.
Mention tools like Locust or k6, focusing on measuring latency (Time to First Token, Total Time), concurrency limits, and cost implications.
Test tool selection logic, error handling when a tool fails, the final synthesis step, and overall goal completion rate.
A curated, high-quality dataset with known correct answers used as a benchmark to consistently measure model performance over time.
Strategies: set temperature=0 for reproducibility in testing, run tests multiple times and use statistical significance, focus on evaluation ranges rather than binary pass/fail.
Testing the semantic understanding of a user query where exact output matching is meaningless; you need to test intent classification and response quality.
Advanced
10 questionsDiscuss a pipeline: log samples, run automated evaluations (model-as-judge, heuristic checks), compute rolling metrics, and set up alerts for drift or quality drops.
Custom: full control, no cost, tailored to needs. Commercial: faster setup, built-in tracing, collaboration features, vendor lock-in risk.
When benchmark test data leaks into model training data. Test for it by checking if the model's performance is suspiciously high on specific benchmarks, or using canary strings.
It requires domain expert collaboration to categorize error severity, extensive human-in-the-loop evaluation, and setting threshold metrics per error category.
Use techniques like templating with variations, leveraging another LLM to generate edge cases, and incorporating real-world user logs (anonymized) to ensure coverage.
Parse and evaluate the reasoning steps separately for coherence, factual accuracy, and relevance to the final answer. This requires specialized evaluators.
Routing a small percentage of live traffic to a new model version. Test by comparing key metrics (quality, latency, cost) between canary and stable models in real time.
Compare the fine-tuned model's performance on a broad benchmark suite (not just the fine-tuning task) against the base model. Performance drops indicate forgetting.
Use a set of carefully crafted prompts with deterministic outputs (low temp) that act as a signature. A change in these outputs signals a model swap.
Accuracy is correctness. Calibration is whether the model's confidence scores (e.g., '90% sure') align with actual correctness rates. Poor calibration harms user trust.
Scenario-Based
10 questionsOutline steps: reproduce, trace logs, check input/output, assess guardrails failure, then implement stronger safety classifiers, expand medical test suite, and improve disclaimer logic.
Check: 1) Pipeline logs for data corruption, 2) Rebuild vector store, 3) Run isolated evaluation on old vs. new data, 4) Compare retrieval metrics (recall, precision) before and after.
Prioritize: 1) Use a small, curated set of existing documents, 2) Manually create 20-30 golden examples, 3) Run automated evaluations for conciseness, faithfulness, 4) Conduct quick human evaluation with team, 5) Define metrics for launch and post-launch monitoring.
Craft prompts that ask for or contain PII (names, emails). Test if the model echoes it, memorizes it, or can be tricked into revealing training data containing PII.
Test the data pipeline (sanitization, formatting), validate the training script with a tiny subset, monitor training loss curves for anomalies, and run a suite of capability and safety evaluations before and after.
1) Create a test suite with prompts targeting known vulnerable patterns (SQLi, XSS). 2) Integrate static analysis (like Bandit for Python) into the output evaluation. 3) Implement a post-generation security linter as a guardrail.
Use identical prompts, a large and diverse test set, control for temperature/top-p, measure not just quality but also latency, cost, and rate limits. Use statistical tests on the results.
Build a multilingual test set with idioms, cultural references, and sensitive topics. Use native speaker evaluators. Check for translation quality, semantic preservation, and culturally appropriate responses.
Design conversation scripts that span 10+ turns, referencing past facts. Test if the model recalls correctly, doesn't confuse memories, and handles contradictions gracefully.
It indicates a gap in your test data distribution. Solution: 1) Collect or generate data from that demographic, 2) Expand your benchmark, 3) Investigate if the base model has a bias or if it's a prompt/application layer issue.
AI Workflow & Tools
10 questionsDescribe using the tracing feature to inspect each step: the input query, retrieved documents, prompt sent to LLM, and final output. Identify where faithfulness or relevance breaks down.
Load a dataset from HF Hub, define a custom evaluation function or use a standard metric from the `evaluate` library, run inference on model predictions, and compute metrics in a reproducible way.
You can test tool selection by providing a prompt and a list of functions, then verify the model calls the correct function with the correct, structured arguments.
Steps: Code Commit -> Build Docker Image -> Run Unit Tests -> Deploy to Staging -> Run Integration & AI Quality Tests (eval suite on a small golden dataset) -> If pass, deploy to Canary -> Monitor -> Full rollout.
Log the prompt template as a parameter, the input query, the output, and computed evaluation metrics (e.g., custom score, latency). Use W&B Tables to compare results side-by-side.
Purpose: to test retrieval accuracy. Use: Populate a test vector store with known documents, run retrieval with test queries, and measure if the correct documents are returned in the top-k results.
Containerize the AI application and its dependencies (model, DB, etc.). Use Kubernetes to spin up ephemeral namespaces for each test run, ensuring no state leakage between tests.
Create a Python script that loads test data, runs the model, computes the metric, and exits with a non-zero code if below threshold. Call this script from a GitHub Actions workflow step.
Set up a sampling pipeline that sends a % of production logs to DeepEval for automated scoring on metrics like toxicity, hallucination, and relevance. Alert if scores degrade.
Configure Model Monitor to track data drift (input distribution changes) and model quality drift (output metric degradation) over time, triggering alerts or retraining pipelines.
Behavioral
5 questionsLook for structured STAR response focusing on the investigation process, collaboration, and the impact of the finding.
Assess ability to translate technical concepts into business impact (e.g., 'The model might sometimes confidently make up facts, which could lead to customer distrust or incorrect decisions').
Look for a prioritization framework based on risk: safety, core functionality, and high-usage paths are essential; edge cases and minor polish can be deferred with a plan.
Evaluate ability to use data (test results, user feedback), articulate impact from a user's perspective, and seek a compromise or escalation path.
Look for specific habits: following key researchers (e.g., via Twitter/X), reading arXiv papers, participating in communities (like EleutherAI), taking short courses, and experimenting with new tools.