AI Benchmark Dataset Designer
An AI Benchmark Dataset Designer architects curated evaluation datasets that objectively measure AI model capabilities, safety, fa…
Skill Guide
Data contamination detection and train-test leakage prevention is the rigorous process of identifying and eliminating unintended overlaps, dependencies, or shared information between training, validation, and test datasets to ensure model evaluation metrics reflect true generalization performance.
Scenario
You are given a Kaggle-style dataset for customer churn prediction. The provided train/test split might contain overlapping customer IDs or future data leaking into training.
Scenario
You inherit a scikit-learn pipeline where the `StandardScaler` is fitted on the entire dataset before splitting, causing information leakage.
Scenario
Your company is evaluating a large language model on public benchmarks (e.g., MMLU, HumanEval). You need to verify that the model's training data did not contain the test questions, which would inflate scores.
Use scikit-learn's Pipeline to encapsulate all preprocessing steps that should only see training data. Use pandas for quick overlap analysis. Great Expectations can be integrated into CI/CD to assert data expectations like 'test set must not contain IDs from training set'.
The Split-Then-Transform Rule: never perform any data transformation before splitting. Temporal Integrity: in time-series data, test set must always be chronologically future. Group-Aware Splitting: for data with hierarchical structures (e.g., multiple samples per user), split by group (user ID) to prevent user-level leakage.
Use these platforms to version control your datasets and the specific splits used for each experiment, ensuring reproducibility and auditability of leakage prevention measures.
Answer Strategy
Structure the answer using a root-cause analysis framework. First, examine the data pipeline for preprocessing leakage (scaling, imputation before split). Second, check for entity leakage (same user/device in train and test). Third, verify temporal leakage for time-dependent data. Provide a concrete example of finding a feature derived from the target (target leakage) and how you'd fix it using a Pipeline.
Answer Strategy
The core competency tested is understanding group-based and temporal splitting. The answer must explicitly state: 1) Never split randomly. 2) Use a time-based split where training data ends on day T, and evaluation uses data from day T+1 onward. 3) Additionally, perform group-based splitting where you hold out a percentage of users completely from training (the 'cold start' test set) to evaluate on unseen users. Explain that random splitting would allow future interactions to leak into training, causing massive overestimation of performance.
1 career found
Try a different search term.