Skill Guide

Cloud data infrastructure on AWS (S3, Glue, Lake Formation, Athena), GCP (BigQuery, GCS, Dataplex), or Azure (Synapse, ADLS)

The design, implementation, and management of scalable, secure, and cost-optimized data storage, processing, and governance layers on a primary public cloud platform (AWS, GCP, or Azure).

This skill enables organizations to centralize disparate data assets into a single source of truth, accelerating analytics and ML model development while enforcing governance. It directly reduces time-to-insight from weeks to hours and minimizes costly data silos and compliance risks.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn Cloud data infrastructure on AWS (S3, Glue, Lake Formation, Athena), GCP (BigQuery, GCS, Dataplex), or Azure (Synapse, ADLS)

1. **Core Cloud Storage Primitives**: Master object storage (S3/GCS/ADLS) - bucket/container creation, lifecycle policies, storage classes (Standard/Infrequent Access), and access control (IAM/ACLs). 2. **Foundational Query & Catalog Services**: Understand serverless query engines (Athena/BigQuery) and basic metadata cataloging (AWS Glue Data Catalog/GCP Data Catalog). 3. **Security Fundamentals**: Learn IAM policies, service roles, and network security (VPCs/Private Service Connect/Private Endpoints) for data services.

1. **ETL/ELT Pipeline Construction**: Build pipelines using managed orchestration (AWS Step Functions/Cloud Composer/Azure Data Factory) and transformation services (Glue ETL/Dataproc/Synapse Pipelines). Focus on schema-on-read vs. schema-on-write. 2. **Data Lake Governance**: Implement centralized governance with services like Lake Formation/Dataplex/Synapse's managed virtual network, managing permissions at the database/table/column level. 3. **Cost & Performance Optimization**: Practice partitioning, compression (Parquet/ORC), and cost controls (BigQuery slot reservations/S3 Intelligent-Tiering/ADLS tiering). Avoid common anti-patterns like small file proliferation.

1. **Multi-Cloud & Hybrid Architecture Design**: Architect solutions that leverage best-of-breed services across clouds or integrate with on-premise systems using data mesh principles and federated governance. 2. **FinOps & Continuous Optimization**: Implement automated cost monitoring, anomaly detection, and rightsizing recommendations for storage and compute. 3. **Strategic Data Product Development**: Design scalable, self-service data products and platforms for business domains, mentoring teams on infrastructure-as-code (Terraform/CloudFormation) and CI/CD for data pipelines.

Practice Projects

Beginner

Project

Build a Secure, Queryable Data Lake Foundation

Scenario

You receive raw JSON and CSV log files from multiple application teams. You need to store them centrally, catalog them for discovery, and allow analysts to run SQL queries without managing servers.

How to Execute

1. Create a structured bucket/container hierarchy (e.g., `raw/`, `curated/`, `processed/`) in your chosen cloud's object storage. 2. Set up a metadata crawler (AWS Glue Crawler, GCP Data Catalog Entry Group, Azure Storage Mover) to automatically discover and catalog the schema of new files. 3. Configure serverless query engine access (Athena, BigQuery, Synapse Serverless) to the cataloged tables. 4. Apply a basic IAM policy that grants read-only access to analysts and full access to the ETL service role.

Intermediate

Project

Implement a Governed ETL Pipeline with a Centralized Permission Model

Scenario

Finance and Marketing teams require curated datasets from the raw data lake. You must build an automated pipeline that transforms data and manages cross-team access without direct bucket/container sharing.

How to Execute

1. Define transformation logic (Spark/PySpark in Glue ETL, Dataproc, or Synapse Spark) to clean, join, and aggregate raw data into business-specific curated tables. 2. Use a centralized governance service (Lake Formation/Dataplex) to register the curated data location and define granular permissions (e.g., `finance_analysts` can SELECT on `curated_finance.*` tables). 3. Orchestrate the pipeline using a managed service (Step Functions/Cloud Composer/Data Factory) to run on schedule or event trigger. 4. Implement data quality checks (Great Expectations, Deequ) as a pipeline step to validate transformations before publishing.

Advanced

Project

Architect a Multi-Domain Data Product Platform

Scenario

The organization is scaling rapidly. Different business units (Supply Chain, R&D, Customer Analytics) need self-service, domain-owned data products with SLA guarantees, while central governance must enforce security and cost controls.

How to Execute

1. Design a domain-driven architecture where each team manages their own 'data product' infrastructure (storage, compute, catalog) using Infrastructure-as-Code templates you provide. 2. Implement a federated governance layer using Lake Formation/Dataplex tags or Azure Purview to apply global policies (PII tagging, cost allocation tags) across all domains. 3. Build a self-service portal (using CloudWatch/Monitoring APIs) for domains to provision compliant data pipelines and monitor their own cost/performance. 4. Establish a central platform team model to provide support, maintain core services, and curate a library of reusable pipeline components.

Tools & Frameworks

Core Infrastructure Services

AWS S3, AWS Glue, AWS Lake Formation, Amazon AthenaGoogle Cloud Storage (GCS), BigQuery, DataplexAzure Data Lake Storage (ADLS) Gen2, Azure Synapse Analytics, Azure Purview

The fundamental building blocks. You must know the specific use case, pricing model, and integration patterns for each service in your primary cloud. For example, use S3/GCS/ADLS for raw storage, Glue/Dataplex for metadata and governance, and Athena/BigQuery/Synapse for serverless SQL.

Infrastructure as Code & Orchestration

Terraform (with AWS/GCP/Azure providers)AWS CloudFormation / AWS CDKGoogle Cloud Deployment Manager / PulumiApache Airflow (Managed via MWAA/Cloud Composer/Azure Data Factory)

Critical for repeatability and auditability. Use Terraform or native IaC to define all storage buckets, IAM roles, and catalogs as code. Use managed Airflow services for complex, dependency-driven pipeline orchestration beyond simple cron.

Data Processing & Quality

Apache Spark (via AWS Glue ETL, Dataproc, Synapse Spark)dbt (data build tool) for SQL-based transformationsGreat Expectations / AWS Deequ for data quality validation

Spark is the workhorse for large-scale data transformation. dbt is the industry standard for version-controlled, documented SQL transformations in the curated layer. Data quality tools are non-negotiable for production pipelines to prevent 'garbage in, garbage out'.