Skill Guide

Requirements engineering for probabilistic systems - writing acceptance criteria that handle non-deterministic outputs

The process of specifying measurable, testable conditions for system behavior where outputs are inherently probabilistic, often using statistical or distributional criteria rather than absolute pass/fail.

This skill prevents costly project failures by aligning stakeholder expectations with the inherent uncertainty of ML/AI and complex algorithmic systems. It directly impacts business outcomes by enabling faster deployment of high-value, non-deterministic features while managing risk.

1 Careers

1 Categories

8.7 Avg Demand

20% Avg AI Risk

How to Learn Requirements engineering for probabilistic systems - writing acceptance criteria that handle non-deterministic outputs

Focus on: 1) Understanding the difference between deterministic and probabilistic system outputs. 2) Learning foundational statistical concepts (mean, variance, confidence intervals). 3) Practicing writing acceptance criteria in the format 'The system shall, over N test runs, achieve X with a confidence of Y%'.

Move to practice by defining criteria for real scenarios like a fraud detection model or a recommendation engine. Key methods: using statistical process control (SPC) charts for monitoring, defining tolerance bands for outputs. Common mistake: specifying criteria that are impossible to test with a realistic sample size.

Mastery involves designing criteria for complex, multi-component systems where uncertainty propagates. This includes specifying criteria for fairness and bias in probabilistic outputs, aligning acceptance thresholds with business risk tolerance (e.g., defining 'acceptable loss' in a predictive maintenance system), and creating feedback loops for continuous model monitoring against these criteria.

Practice Projects

Beginner

Case Study/Exercise

Defining Criteria for a Spam Classifier

Scenario

Your team has built a spam filter with a stated accuracy of 95%. The product owner wants to know the acceptance criteria for deployment. The output is non-deterministic; some emails will be misclassified.

How to Execute

1. Identify the key performance metric (e.g., Precision for spam class). 2. Define the statistical threshold (e.g., 'Precision must be >= 92%'). 3. Define the confidence requirement (e.g., 'with 95% confidence'). 4. Specify the test dataset size and composition required to validate this.

Intermediate

Project

Specifying SLAs for a Probabilistic API

Scenario

You are engineering lead for a content recommendation API. The service returns a ranked list of items, and the 'ideal' order is subjective. You need to write acceptance criteria for a new vendor's algorithm.

How to Execute

1. Move from point-estimate criteria (e.g., 'NDCG@10 must be 0.8') to distributional criteria (e.g., 'The median NDCG@10 across user cohorts must be >= 0.75, with the 5th percentile >= 0.6'). 2. Define performance across user segments (new vs. power users). 3. Incorporate business metrics as guardrails (e.g., 'click-through rate must not drop by more than 5% from the baseline'). 4. Specify A/B test duration and sample size for statistical significance.

Advanced

Case Study/Exercise

Acceptance Criteria for an Autonomous Driving Subsystem

Scenario

You are responsible for signing off on a new object detection model for a self-driving car's perception stack. The system's failure modes (false negatives) have catastrophic potential. The output is inherently probabilistic.

How to Execute

1. Define safety-critical acceptance criteria at multiple layers: component (model-level) and integrated system-level. 2. Use extreme value theory and specify criteria for tail-risk performance (e.g., 'The 99.99th percentile false negative rate for pedestrians in rainy conditions at night must be < 1e-7'). 3. Define 'deterministic' fallback behaviors the system must exhibit when probabilistic confidence drops below a threshold. 4. Create a verification and validation (V&V) plan using simulation, closed-course testing, and shadow-mode deployment to gather the massive, high-quality data needed to test these criteria with statistical rigor.

Tools & Frameworks

Statistical & ML Tools

Scikit-learn (metrics module)TensorFlow Extended (TFX) - Model ValidationMLflow

Used for calculating the metrics that underpin acceptance criteria. TFX Model Validation and MLflow are critical for automating the tracking and validation of statistical properties against defined thresholds during CI/CD for ML.

Methodologies & Frameworks

Requirements Traceability Matrix (RTM)Quality Attribute Scenarios (QAS)ISO/IEC/IEEE 29148:2018 (Requirements Engineering Standard)

RTM and QAS help structure and trace probabilistic criteria. The ISO standard provides the rigorous foundation for writing any requirement, including probabilistic ones, ensuring they are verifiable and unambiguous.

Testing & Monitoring Platforms

Evidently AI (for data & model drift)WhyLabsGreat Expectations

Essential for operationalizing acceptance criteria. These tools monitor production systems for statistical drift and can trigger alerts or rollbacks when key probabilistic metrics breach acceptance thresholds.

Interview Questions

Answer Strategy

The candidate must demonstrate they can translate a business goal into verifiable, probabilistic technical specs. Strategy: 1) Clarify the business risk tolerance for false positives (defaults). 2) Define a primary metric (e.g., AUC-ROC, KS statistic). 3) Define secondary criteria around fairness and bias (e.g., equal opportunity difference). 4) Specify criteria for performance stability across time and segments. Sample Answer: 'First, I'd partner with risk to quantify the acceptable increase in default rate, say from 2% to 2.5%. Then, the primary acceptance criterion becomes: The model's Gini coefficient must be >= 0.45 on a holdout set, validated with 95% confidence. Additionally, I'd require the equal opportunity difference across protected groups to be below 0.05. Finally, I'd specify that these metrics must remain stable within +/- 5% over a 3-month monitoring window.'

Answer Strategy

Tests the ability to create testable criteria for subjective outputs. Core competency: Moving from output evaluation to process and constraint evaluation. Sample Answer: 'For a creative generative system, I avoid specifying the exact output. Instead, I define acceptance in layers. 1) Technical Constraints: The model must always produce a syntactically valid JSON object if that's the expected format. 2) Safety & Guardrails: The output must pass a toxicity classifier with a 99% confidence score. 3) Quality Attributes: Using a rubric, human evaluators must rate the output's relevance and coherence as 'good' or better in at least 7 out of 10 sampled outputs. This criteria is then tested via automated format/safety checks and a statistically significant human evaluation loop.'