Skip to main content

Skill Guide

Basic Cloud Computing (AWS/GCP/Azure) for Data

The operational knowledge of provisioning, managing, and utilizing cloud services (compute, storage, databases) from providers like AWS, GCP, or Azure specifically to build, run, and scale data processing and analytics workloads.

It enables organizations to move from costly, inflexible on-premise data infrastructure to elastic, scalable, and managed services, directly reducing time-to-insight and operational overhead. This skill is fundamental for modern data teams to build robust data pipelines, run scalable analytics, and deploy machine learning models efficiently.
1 Careers
1 Categories
8.5 Avg Demand
20% Avg AI Risk

How to Learn Basic Cloud Computing (AWS/GCP/Azure) for Data

Focus on: 1) Core cloud service models (IaaS, PaaS, SaaS) and the shared responsibility model. 2) Foundational services for data: object storage (S3/GCS/Blob), managed relational databases (RDS/Cloud SQL/Azure SQL), and basic compute instances (EC2/Compute Engine/VMs). 3) Fundamental concepts like regions, availability zones, and basic IAM (Identity and Access Management) policies for security.
Move to practice by building data pipelines. Use managed ETL/ELT services (AWS Glue, Dataflow, Azure Data Factory). Learn to choose between serverless (AWS Lambda, Cloud Functions) and containerized (EKS/GKE/AKS) compute for different workloads. Avoid common mistakes like over-provisioning resources, neglecting cost monitoring (AWS Cost Explorer, GCP Billing), or ignoring data egress fees.
Master multi-cloud/hybrid-cloud data architecture design. Focus on strategic cost optimization (Reserved Instances, Committed Use Discounts), advanced security & governance (AWS Lake Formation, Azure Purview, GCP Dataplex), and designing for high availability and disaster recovery. At this level, you mentor teams on cloud-native patterns (data mesh, data lakehouse) and evaluate new cloud services against strategic business needs.

Practice Projects

Beginner
Project

Deploy a Static Data Dashboard on the Cloud

Scenario

You have a CSV file with sales data and need to host a simple web dashboard (e.g., using Streamlit or Dash) that visualizes it, making it accessible to your team.

How to Execute
1. Upload your CSV and dashboard script to a cloud storage bucket (S3/GCS/Blob). 2. Launch a small compute instance (EC2/Compute Engine/VM) and install necessary packages. 3. Pull the code from storage, run the dashboard app, and configure the instance's security group/firewall to allow web traffic on the app's port. 4. Access the dashboard via the instance's public IP address.
Intermediate
Project

Build an Automated, Scheduled Data Pipeline

Scenario

Your company needs a daily report from a third-party API (e.g., weather data, social media metrics). The data must be cleaned, transformed, and stored in a queryable database for analysis.

How to Execute
1. Use a serverless function (Lambda/Cloud Function) or a managed orchestration service (AWS Step Functions, Airflow on Cloud Composer) to trigger a script daily. 2. The script calls the API, transforms the data (e.g., using Pandas in a managed notebook or within the function). 3. Load the cleaned data into a managed data warehouse (Redshift/BigQuery/Synapse). 4. Set up cloud monitoring and alerting (CloudWatch/Stackdriver/Azure Monitor) to notify on pipeline failures.
Advanced
Project

Design and Deploy a Multi-Source Data Lakehouse with Governance

Scenario

As a data architect, you must design a system that ingests batch (database extracts) and real-time (clickstream) data into a central repository, enabling both SQL analytics and machine learning, while enforcing fine-grained access control and data cataloging.

How to Execute
1. Architect a lakehouse using cloud storage (S3/GCS/Blob) as the base, with a table format (Delta Lake, Iceberg) for ACID transactions. 2. Implement ingestion: use a managed service for change data capture (AWS DMS) for batch and a streaming service (Kinesis/Pub-Sub/Event Hubs) for real-time, landing data in a raw zone. 3. Use a transformation engine (Spark on EMR/Dataproc/Synapse) to create curated zones. 4. Implement a unified governance layer (AWS Lake Formation + Glue Data Catalog, Azure Purview, GCP Dataplex) to manage metadata, lineage, and column-level security, exposing curated tables to consumers via a SQL engine (Athena, BigQuery, Synapse Serverless).

Tools & Frameworks

Core Data Services (by Provider)

AWS: S3, RDS, Redshift, Glue, Athena, Kinesis, SageMakerGCP: Cloud Storage, Cloud SQL, BigQuery, Dataflow, Pub/Sub, Vertex AIAzure: Blob Storage, Azure SQL, Synapse Analytics, Data Factory, Event Hubs, Azure ML

These are the primary managed services for storage, compute, analytics, and ML. The choice depends on existing provider ecosystem, specific feature needs (e.g., BigQuery's serverless SQL), and team expertise. A practitioner must know equivalent services across providers to evaluate trade-offs.

Infrastructure as Code (IaC) & Orchestration

TerraformAWS CloudFormationAzure Resource Manager (ARM) TemplatesGoogle Cloud Deployment ManagerApache Airflow (Managed: MWAA, Cloud Composer, Azure Data Factory Pipelines)

IaC tools are non-negotiable for reproducible, version-controlled cloud environments. Orchestration tools (like managed Airflow) are critical for scheduling, monitoring, and managing complex data pipelines across multiple services.

Cost Management & Monitoring

AWS Cost Explorer & BudgetsGCP Billing Reports & Budget AlertsAzure Cost Management + BillingCloudWatch / Cloud Monitoring / Azure MonitorDatadog / Grafana (for multi-cloud)

Cloud cost is a direct operational expense. These tools are used proactively to set budgets, analyze spending by service/tag, and set alerts. Monitoring tools track resource utilization and application performance, which is essential for optimizing cost and ensuring pipeline reliability.

Interview Questions

Answer Strategy

Test knowledge of NoSQL vs. SQL trade-offs and service selection. Use a decision framework: 1) Identify data model (semi-structured -> NoSQL). 2) Identify access pattern (low latency, key-value -> DynamoDB/Cosmos DB/Bigtable). 3) Discuss scaling: provisioned vs. on-demand capacity, partition key design to avoid hotspots, and indexing strategies for query patterns. Sample: 'I would choose a managed NoSQL database like DynamoDB. It's optimized for JSON document storage and single-digit millisecond latency. For scaling, the critical factor is partition key design to distribute traffic evenly. I'd start with on-demand capacity mode for unpredictable workloads, then move to provisioned capacity with auto-scaling once patterns are clear, while using Global Tables if multi-region replication is required.'

Answer Strategy

Tests systematic debugging and cost-optimization skills. Use a structured approach: 1) **Isolate the bottleneck**: Was it ingestion, transformation, or loading? Use cloud monitoring to identify slow stages. 2) **Diagnose root cause**: For cost, was it idle resources, data scanning in queries, or data egress? For speed, was it undersized compute, poor serialization, or lack of partitioning? 3) **Implement & Validate**: Apply fix (e.g., change file format from CSV to Parquet, right-size instance, add filtering earlier in the pipeline) and measure improvement. Sample: 'Our nightly Spark job on EMR was taking 6 hours. Using CloudWatch and Spark UI, I found the shuffle stage was the bottleneck due to skewed joins. I implemented salting on the join key to distribute the load evenly and switched the output format from CSV to Parquet with Snappy compression. This reduced runtime to 45 minutes and cut S3 storage costs by 70%.'

Careers That Require Basic Cloud Computing (AWS/GCP/Azure) for Data

1 career found