Skill Guide

Cloud data architecture design across AWS (S3, Glue, Athena, Bedrock), GCP (BigQuery, Vertex AI), or Azure (Synapse, OpenAI Service)

The practice of designing scalable, cost-effective, and secure data systems by composing and integrating the native data processing, analytics, and AI/ML services of a specific major cloud provider (AWS, GCP, or Azure).

This skill is critical because it directly determines an organization's ability to unlock value from data, enabling faster insights, operational efficiency, and the development of AI-powered products. Poor architecture leads to spiraling costs, security vulnerabilities, and project failure, while expert design becomes a strategic competitive advantage.

1 Careers

1 Categories

9.1 Avg Demand

20% Avg AI Risk

How to Learn Cloud data architecture design across AWS (S3, Glue, Athena, Bedrock), GCP (BigQuery, Vertex AI), or Azure (Synapse, OpenAI Service)

Focus on: 1) Understanding core cloud concepts (regions, AZs, IAM, VPC). 2) Mastering the data lifecycle (ingestion, storage, processing, serving) on ONE primary cloud (e.g., AWS). 3) Learning the specific purpose, pricing model, and integration points of key managed services (e.g., S3 as a data lake, Glue for serverless ETL, Athena for ad-hoc query).

Move to practice by: 1) Designing and implementing a complete, end-to-end data pipeline for a specific use case (e.g., clickstream analysis). 2) Focusing on cost optimization techniques (storage tiering, spot instances for Spark jobs, query partitioning). 3) Addressing common pitfalls like vendor lock-in, over-engineering, and inadequate data governance from the start.

Master the skill by: 1) Designing multi-cloud or hybrid-cloud data platforms for large-scale, mission-critical systems. 2) Architecting for real-time, low-latency use cases (e.g., fraud detection) alongside batch processing. 3) Making strategic build-vs-buy decisions for components, leading vendor evaluations, and establishing enterprise-wide data architecture standards and patterns.

Practice Projects

Beginner

Project

Serverless ETL & Analytics Pipeline on AWS

Scenario

A retail company needs to analyze daily sales data from CSV files uploaded to an S3 bucket. They want to query aggregated results (total sales by product category) without managing servers.

How to Execute

1. Create an S3 bucket with a raw/ and processed/ folder structure. 2. Configure an AWS Glue Crawler to infer the schema of the raw CSV files and create a table in the Glue Data Catalog. 3. Write a simple Glue ETL job (Python Spark script) to read the raw data, perform a grouping operation, and write the result in Parquet format to processed/. 4. Run Athena queries against the processed/ location in the Data Catalog to generate reports, validating the entire pipeline.

Intermediate

Project

Multi-Source Data Lake with Schema Evolution on GCP

Scenario

A fintech startup ingests real-time transaction data via Pub/Sub and batch customer profile updates via Cloud Storage. The unified data must be queryable for a marketing segmentation ML model, and the schema of transaction fields may change over time.

How to Execute

1. Design a landing zone in Cloud Storage (bronze/silver/gold layer pattern). 2. Implement a streaming Dataflow job to consume from Pub/Sub, apply basic validation, and write to the bronze layer. 3. Use BigQuery as the analytical warehouse. Create an external table or a scheduled query to load data from GCS, handling schema evolution using BigQuery's schema auto-detection or explicit schema updates. 4. Build and run a Vertex AI training job that reads directly from BigQuery to create the segmentation model, demonstrating the end-to-end flow from raw data to ML consumption.

Advanced

Project

Enterprise Data Mesh Architecture on Azure

Scenario

A large manufacturing enterprise wants to decentralize data ownership. Each business unit (Supply Chain, Manufacturing, Sales) must own its domain data products (e.g., 'Supplier Quality Score', 'Factory OEE') on Azure, while ensuring global discoverability, governance, and standardized access via APIs.

How to Execute

1. Establish a central platform team to provision standardized, self-service Azure resources (e.g., per-domain Synapse workspace, ADLS Gen2 storage, Purview for cataloging). 2. Define and enforce federated computational governance using Azure Policy and custom RBAC roles. 3. Architect a domain team's data product: Ingest source data into their ADLS, process with dedicated Spark pools in Synapse, and serve as an API using Azure Functions or as a certified dataset in Purview. 4. Implement a global data product discovery portal using Purview and a centralized monitoring dashboard for SLAs (latency, freshness, quality).

Tools & Frameworks

Software & Platforms

AWS Well-Architected Framework (Data Analytics Lens)Google Cloud Architecture FrameworkAzure Cloud Adoption Framework - Data Landing Zone

Use these as the foundational checklists and design principles for any architecture. They provide cloud-vendor-specific best practices for security, reliability, cost optimization, and operational excellence.

Data Infrastructure as Code (IaC)

TerraformAWS CDK / CloudFormationAzure Bicep / ARM

Mandatory for creating repeatable, version-controlled, and auditable cloud infrastructure. Terraform is the multi-cloud standard, while vendor-specific tools offer deeper integration. Use to provision S3 buckets, Glue jobs, BigQuery datasets, and Synapse pools.

Data Modeling & Governance

dbt (data build tool)Apache Iceberg / Delta LakeAWS Lake Formation / Azure Purview

dbt is the industry standard for version-controlled, SQL-based transformation logic. Iceberg/Delta Lake add ACID transactions and time travel to cloud data lakes. Governance tools (Lake Formation, Purview) are used to define fine-grained access policies and catalog data assets enterprise-wide.