Skill Guide

Infrastructure as Code (Terraform, Pulumi) for AI infrastructure provisioning

The practice of using declarative or imperative code to automate the provisioning, configuration, and lifecycle management of specialized compute (GPU/TPU clusters), storage, and networking resources required for machine learning model training and inference.

It eliminates manual provisioning errors, enables reproducible environments critical for ML experiment consistency, and reduces infrastructure deployment time from days to minutes. This directly accelerates the ML development cycle, reduces cloud cost waste through precise resource management, and ensures compliance and security standards are baked into every deployment.

1 Careers

1 Categories

9.2 Avg Demand

15% Avg AI Risk

How to Learn Infrastructure as Code (Terraform, Pulumi) for AI infrastructure provisioning

1. Core IaC Concepts: Understand declarative vs. imperative paradigms, state management (Terraform state, Pulumi state), and idempotency. 2. Cloud Provider Fundamentals: Gain proficiency in one major cloud's (AWS, GCP, Azure) resource models for compute (EC2/GCE/Azure VMs, managed instance groups), storage (S3/GCS/Blob), and networking (VPCs, subnets, security groups). 3. Basic Tool Syntax: Write and apply simple configurations in either Terraform HCL or a Pulumi language (Python/TypeScript) to provision a single VM with a public IP.

1. Modularization & Reusability: Refactor monolithic scripts into reusable modules (Terraform modules, Pulumi components) for common AI building blocks like a 'GPU Instance Group' or 'Managed Kubernetes Cluster'. 2. State Collaboration & Locking: Implement remote state backends (S3/GCS + DynamoDB/Cloud Spanner) and state locking for team workflows. 3. CI/CD Pipeline Integration: Automate `terraform plan/apply` or `pulumi up` within a GitOps pipeline (GitHub Actions, GitLab CI) with manual approval gates for production environments. Common Mistake: Not using `-target` or stack references for partial updates, leading to unintended resource recreation.

1. Complex System Design: Architect multi-environment (dev/stage/prod), multi-region deployments with shared state, using workspaces or stack references. Design cost-optimized clusters using spot/preemptible instances with custom replacement policies. 2. Policy as Code: Enforce organizational standards (security, cost, naming) using tools like Sentinel (Terraform) or CrossGuard (Pulumi) integrated into the CI/CD pipeline. 3. Strategic Tool Selection: Mentor teams on choosing between Terraform (for its ecosystem and declarative nature) and Pulumi (for its use of general-purpose languages and complex logic) based on project requirements and team skillsets.

Practice Projects

Beginner

Project

Provision a Single-Node ML Training Environment

Scenario

A data scientist needs a repeatable, disposable environment with a specific GPU (e.g., NVIDIA T4), the latest NVIDIA drivers, Docker, and a defined set of firewall rules for SSH access.

How to Execute

1. Write a Terraform/Pulumi config to create a VM with a specified GPU machine type (e.g., `g4dn.xlarge`). 2. Use a provisioner or Pulumi command resource to run a shell script that installs NVIDIA drivers and Docker. 3. Define a security group/firewall rule allowing SSH (port 22) only from a specific IP range. 4. Output the public IP and provide a command to SSH into the instance.

Intermediate

Project

Deploy a Scalable, Managed Kubernetes Cluster for Model Serving

Scenario

Deploy an EKS/GKE/AKS cluster with node pools configured for inference workloads, integrated with a container registry, and a separate node pool for monitoring tools like Prometheus.

How to Execute

1. Create a reusable module for a managed Kubernetes cluster. 2. Define two node pools: one with CPU-optimized instances for monitoring and one with GPU instances for inference pods, using taints and tolerations. 3. Integrate a Helm provider/release to deploy Prometheus and Grafana onto the monitoring node pool. 4. Use a separate stack or module to provision a private container registry (ECR, GCR, ACR) and configure the cluster's service account to pull images from it.

Advanced

Project

Implement a Self-Service ML Platform with Cost Governance

Scenario

Build an internal platform where ML engineers can request pre-approved, compliant infrastructure stacks (e.g., 'Training Cluster', 'Batch Inference Pipeline') via a service catalog, with automated cost allocation tagging and budgets.

How to Execute

1. Develop a library of hardened, versioned Terraform/Pulumi modules representing the approved infrastructure patterns. 2. Build a thin API/UI layer that allows users to select a pattern, input parameters (team, project, duration), and triggers a deployment via a CI/CD pipeline. 3. Integrate Policy as Code to automatically inject mandatory tags (cost_center, owner) and enforce limits (e.g., max vCPU count, no public IPs on storage). 4. Implement automated cost reporting by querying cloud billing APIs and correlating with the tags applied during provisioning.

Tools & Frameworks

Software & Platforms

TerraformPulumiAWS CloudFormationGoogle Cloud Deployment Manager

Terraform is the industry standard for declarative, cloud-agnostic IaC. Pulumi allows using general-purpose languages (Python, Go, TypeScript) for imperative logic. CloudFormation and Deployment Manager are native, deeply integrated alternatives for their respective clouds but lack portability.

State Management & Collaboration

Terraform Cloud/EnterprisePulumi CloudAWS S3 + DynamoDBGoogle Cloud Storage + Cloud Spanner

Essential for storing infrastructure state securely, enabling team collaboration, and providing state locking to prevent concurrent modifications. The cloud-native options (S3+DynamoDB, GCS+Spanner) are cost-effective for small teams; managed services (TF Cloud, Pulumi Cloud) offer UI, policy, and RBAC features.

Policy & Security Enforcement

HashiCorp SentinelPulumi CrossGuardOpen Policy Agent (OPA)Checkov

Tools for defining and enforcing compliance rules (e.g., 'no public S3 buckets', 'all VMs must be in specific regions') as code, integrated into the deployment pipeline. Checkov scans IaC templates for misconfigurations pre-deployment.

CI/CD & GitOps Integration

GitHub ActionsGitLab CIJenkinsSpacelift

Platforms to automate the plan, review, and apply lifecycle of IaC changes, triggered by version control events. Spacelift is a specialized IaC-aware CI/CD platform with advanced features like drift detection.

Interview Questions

Answer Strategy

The interviewer is testing system design thinking and understanding of the full IaC lifecycle. Structure the answer around: 1) Analysis & Standardization (inventory needs, create golden modules), 2) Automation & Self-Service (build a portal or API), 3) Governance & Cost Control (implement tagging, budgets, policies). Sample Answer: 'First, I'd conduct an audit to identify the most common infrastructure patterns. Then, I'd build versioned, secure Terraform modules for these patterns, integrating them into a CI/CD pipeline with approval gates. To enable self-service, I'd develop a simple interface-perhaps a CLI or internal web form-that triggers the pipeline with predefined parameters, ensuring every deployment is tagged for cost allocation and compliant by default.'

Answer Strategy

This tests practical experience and decision-making. Focus on technical and team factors. Sample Answer: 'For a project requiring complex conditional logic for environment-specific configurations and integration with a custom API, we chose Pulumi (Python). The key factors were: 1) The team's strong Python proficiency, reducing the learning curve, 2) The need for native loops and conditionals, which are more cumbersome in HCL, and 3) The ability to use standard Python error handling and testing frameworks. The outcome was faster development of complex modules and easier onboarding for our data science team who could read the code.'