AI Data Catalog Specialist
An AI Data Catalog Specialist designs, curates, and governs metadata-rich data catalogs that power AI and ML initiatives across th…
Skill Guide
The operational expertise to design, build, and manage the end-to-end data flow that transforms raw data into reliable, version-controlled features for machine learning model training and serving.
Scenario
Using the Kaggle 'House Prices' dataset, create a pipeline that computes aggregate features (e.g., neighborhood average price, price per sqft) and stores them for model training.
Scenario
Simulate an e-commerce platform where user click-stream data must be aggregated into features (e.g., 'user_last_10_items_viewed') and served with low latency (<50ms) to a model API.
Scenario
A large organization has three data science teams building ad-hoc pipelines, leading to duplicated effort, inconsistent feature definitions, and high cloud costs. Leadership mandates a unified, self-serve feature platform.
Used to define, schedule, and monitor complex, multi-step data workflows as directed acyclic graphs (DAGs). Airflow is the industry standard; Kubeflow integrates with Kubernetes for ML-centric workflows; Prefect offers a more modern Pythonic API.
Specialized systems for managing the full lifecycle of features: defining transformations, storing historical (offline) and low-latency (online) feature values, and ensuring consistency between training and serving. Feast is the standard open-source choice; Tecton and SageMaker are managed platforms for production scale.
Used to assert expectations about data (e.g., 'column X must not be null', 'values in column Y must be between 0-100'). Integrates directly into pipelines to halt execution on data drift or corruption, preventing garbage-in-garbage-out model training.
Parquet is the standard columnar format for efficient feature storage. Delta Lake/Iceberg add ACID transactions and time travel on top of cloud data lakes. Protobufs define strict schemas for real-time feature exchange in streaming pipelines.
1 career found
Try a different search term.