AI Data Lake Engineer
An AI Data Lake Engineer designs, builds, and optimizes large-scale data lake and lakehouse architectures purpose-built for AI and…
Skill Guide
The design, implementation, and management of scalable, secure, and cost-optimized data storage, processing, and governance layers on a primary public cloud platform (AWS, GCP, or Azure).
Scenario
You receive raw JSON and CSV log files from multiple application teams. You need to store them centrally, catalog them for discovery, and allow analysts to run SQL queries without managing servers.
Scenario
Finance and Marketing teams require curated datasets from the raw data lake. You must build an automated pipeline that transforms data and manages cross-team access without direct bucket/container sharing.
Scenario
The organization is scaling rapidly. Different business units (Supply Chain, R&D, Customer Analytics) need self-service, domain-owned data products with SLA guarantees, while central governance must enforce security and cost controls.
The fundamental building blocks. You must know the specific use case, pricing model, and integration patterns for each service in your primary cloud. For example, use S3/GCS/ADLS for raw storage, Glue/Dataplex for metadata and governance, and Athena/BigQuery/Synapse for serverless SQL.
Critical for repeatability and auditability. Use Terraform or native IaC to define all storage buckets, IAM roles, and catalogs as code. Use managed Airflow services for complex, dependency-driven pipeline orchestration beyond simple cron.
Spark is the workhorse for large-scale data transformation. dbt is the industry standard for version-controlled, documented SQL transformations in the curated layer. Data quality tools are non-negotiable for production pipelines to prevent 'garbage in, garbage out'.
1 career found
Try a different search term.