AI Data Lake Engineer
An AI Data Lake Engineer designs, builds, and optimizes large-scale data lake and lakehouse architectures purpose-built for AI and…
Skill Guide
The process of designing a unified data architecture that combines the cost-effectiveness and scalability of data lakes with the ACID transaction support and data management features of data warehouses, using table format frameworks like Delta Lake, Apache Iceberg, or Apache Hudi.
Scenario
You are tasked with replacing a nightly batch-loaded data warehouse table for e-commerce orders with a live, transactional table on S3/ADLS that supports real-time updates and historical queries.
Scenario
A streaming pipeline is writing millions of small JSON sensor readings per hour to a Lakehouse table, causing slow query performance and high metadata overhead.
Scenario
Design a Lakehouse architecture for a financial services firm that must serve raw data to data scientists (Spark), curated data to BI analysts (Trino), and masked data to external auditors, all with column-level security and full audit lineage.
The core open table formats providing ACID transactions. Iceberg is known for its rich partition evolution and hidden partitioning. Delta Lake excels in the Databricks/Spark ecosystem with features like Z-ORDER. Hudi offers strong support for incremental data processing and built-in CDC support.
Used for ETL, batch processing, and interactive queries. Spark is the primary engine for data manipulation. Trino is used for federated, low-latency SQL analytics. Flink is used for stream processing and ingesting data into the Lakehouse.
The underlying storage layer for the Lakehouse, providing scalable, durable, and cost-effective data file storage. The table format metadata is also stored here.
Glue/Unity Catalog provide metastore services for table metadata and access control. Atlas provides metadata management and lineage. Ranger provides fine-grained authorization policies that can be integrated with table formats.
1 career found
Try a different search term.