AI Synthetic Data Engineer
An AI Synthetic Data Engineer designs, generates, and validates artificial datasets that replicate the statistical properties of r…
Skill Guide
The ability to write efficient, production-grade Python code that leverages pandas for data manipulation, NumPy for numerical computation, scikit-learn for classical machine learning, and PyTorch for deep learning research and deployment.
Scenario
You are given a messy CSV file of historical sales data with missing values, inconsistent date formats, and duplicate entries.
Scenario
Build a deployable model to predict customer churn using a dataset with numerical, categorical, and text features.
Scenario
Develop a U-Net model for medical image segmentation where pre-trained models are insufficient, requiring custom data loading and augmentation.
pandas/NumPy for data manipulation and numerical ops; scikit-learn for traditional ML modeling and preprocessing; PyTorch for dynamic deep learning model development. Use PyTorch's `DataLoader` for batching, scikit-learn's `Pipeline` for encapsulating steps.
Use notebooks for exploration; IDEs for large project code quality; Docker for creating reproducible environments; experiment tracking tools (W&B, MLflow) for logging metrics, hyperparameters, and model versions.
Answer Strategy
Demonstrate knowledge of computational efficiency. First, optimize pandas data loading using dtypes (e.g., category dtype for low-cardinality strings) and read only necessary columns. Second, use scikit-learn's `SGDClassifier` or `PassiveAggressiveClassifier` with `partial_fit` for out-of-core learning. Third, apply feature selection (e.g., `SelectFromModel` with L1 regularization) before training. Fourth, consider using `HistGradientBoostingClassifier` which handles NaNs and is highly optimized.
Answer Strategy
Tests debugging methodology and deep learning intuition. Strategy: 1) Check data pipeline: Verify the `DataLoader` is returning correct labels and images by visualizing a batch. 2) Simplify: Overfit a single batch to see if the model can learn at all. 3) Check hyperparameters: Ensure learning rate is not too high/low, and loss function is appropriate. 4) Inspect gradients: Use `torch.autograd.gradcheck` or log gradient norms to check for vanishing/exploding gradients. 5) Verify model architecture: Ensure layers are connected correctly (e.g., print model summary).
1 career found
Try a different search term.