Skill Guide

Infrastructure-as-Code for reproducible AI environments (Terraform, Pulumi)

The practice of defining and managing AI infrastructure (compute clusters, storage, networking) through declarative or imperative code to guarantee identical, version-controlled environments for model training, experimentation, and deployment.

It eliminates 'it works on my machine' syndrome in AI/ML, drastically reducing environment drift, debugging time, and onboarding friction. This directly accelerates the ML lifecycle, lowers cloud costs via predictable resource management, and enables reliable, auditable, and scalable AI operations.

1 Careers

1 Categories

8.7 Avg Demand

25% Avg AI Risk

How to Learn Infrastructure-as-Code for reproducible AI environments (Terraform, Pulumi)

1. Core IaC concepts: Understand state management, declarative vs. imperative paradigms, and resource graphs. 2. Terraform fundamentals: Master HCL syntax, providers (especially AWS, GCP, Azure), and the plan-apply-destroy workflow for basic resources (VPC, compute instance). 3. Git-based workflow: Enforce a strict branching strategy (e.g., trunk-based) for IaC code, treating it as production software.

1. Modularization: Design reusable Terraform modules/Pulumi components for common AI patterns (e.g., 'gpu-instance-cluster', 'managed-notebook-environment'). 2. State management at scale: Implement remote state backends (S3, GCS, Terraform Cloud) with state locking and access controls. 3. CI/CD integration: Build pipelines (GitHub Actions, GitLab CI) for automated `terraform plan` on pull requests and `apply` on merge, with policy checks (e.g., Sentinel). Avoid coupling IaC code with application logic and hardcoding secrets.

1. Multi-cloud & hybrid orchestration: Design and manage complex, cross-cloud (e.g., AWS for training, GCP for serving) or hybrid (cloud + on-prem GPU cluster) topologies with consistent abstractions. 2. Policy-as-Code & compliance: Integrate Open Policy Agent (OPA) or HashiCorp Sentinel to enforce security, cost, and tagging policies automatically. 3. Platform engineering: Architect an internal developer platform (IDP) where data scientists self-serve approved environments via a UI or API that triggers your IaC backends, abstracting away all cloud complexity.

Practice Projects

Beginner

Project

Reproducible Jupyter Lab on a Cloud GPU Instance

Scenario

A data scientist needs a specific Ubuntu 22.04 environment with CUDA 12.1, PyTorch 2.0, and a mounted 500GB EBS volume for data, identical every time it's provisioned.

How to Execute

1. Write a Terraform script for an AWS EC2 p3.2xlarge (NVIDIA V100) instance, a security group allowing SSH (key-based auth only), and an EBS volume. 2. Use a `user_data` script to install specific CUDA drivers, Docker, and launch a Jupyter container. 3. Parameterize variables (instance_type, volume_size, key_name) in a `variables.tf` file. 4. Execute `terraform init`, `terraform plan`, `terraform apply` to provision. Destroy with `terraform destroy` after use.

Intermediate

Project

Pulumi-Driven MLflow Tracking Server with Managed Database

Scenario

An ML team requires a production-grade MLflow server with a PostgreSQL backend, accessible only within a private VPC, and all resources must be tagged with project/cost-center.

How to Execute

1. In Python (using Pulumi AWS SDK), define a VPC with private subnets. 2. Provision a PostgreSQL RDS instance in a private subnet, storing credentials in AWS Secrets Manager. 3. Deploy an MLflow server on an ECS Fargate service within the same VPC, configured to use the RDS backend and pull secrets. 4. Implement a Pulumi component resource that encapsulates this stack, allowing reuse across projects via `pulumi up`.

Advanced

Project

Multi-Environment, Self-Service AI Platform with Guardrails

Scenario

Your organization needs a system where ML engineers can spin up isolated, GPU-enabled Kubernetes clusters (dev/staging/prod) via a Git commit, while automatically enforcing cost limits, security baselines, and audit trails.

How to Execute

1. Architect Terraform modules for a baseline GKE/AKS cluster with node pools (GPU & CPU), integrated logging, and network policies. 2. Wrap these modules in a Pulumi automation API-driven service (e.g., a Python/Go webhook) that listens for a GitLab/GitHub repository dispatch event containing environment specs. 3. Integrate OPA policies into the CI/CD pipeline to validate the requested environment against organizational rules (e.g., max GPU count, required tags). 4. Provide a front-end portal or CLI that triggers the dispatch, abstracting the IaC complexity for the end-user data scientist.

Tools & Frameworks

Core IaC Platforms

Terraform (HashiCorp)PulumiAWS CloudFormationGoogle Cloud Deployment Manager

Terraform is the industry standard for multi-cloud declarative IaC using HCL. Pulumi allows defining infrastructure in general-purpose languages (Python, Go, TypeScript), enabling complex logic and code reuse. CloudFormation and Deployment Manager are native, tightly integrated but cloud-locked options. Choose Terraform for broad multi-cloud strategy; choose Pulumi when infrastructure logic benefits from sophisticated programming constructs or integration with existing application codebases.

AI/ML Environment Tooling

DockerNVIDIA Container ToolkitKubernetes (K8s)MLflowWeights & Biases

Docker is fundamental for creating reproducible application and environment images. The NVIDIA toolkit enables GPU passthrough to containers. K8s (often managed like EKS, GKE) orchestrates containerized workloads at scale. MLflow and W&B are platforms that themselves require IaC for their backends (tracking servers, artifact stores) and are often integrated into the environments you provision.

Security, Policy & State Management

Terraform Cloud / Terraform EnterpriseHashiCorp VaultOpen Policy Agent (OPA)AWS Secrets Manager / GCP Secret Manager

Terraform Cloud/Enterprise provides remote state, collaboration, and governance. Vault is essential for dynamic secrets management (e.g., database credentials). OPA is the standard for writing and enforcing fine-grained policies across your IaC pipeline. Use cloud-native secret managers for storing credentials accessed by your provisioned environments.

Interview Questions

Answer Strategy

Test for systematic debugging and proactive IaC design. 1) Acknowledge the issue is environment drift. 2) Immediate fix: Check the IaC code for the instance's `user_data` or provisioners for version pins (CUDA, drivers, OS packages). Compare it to the last applied state. 3) Root cause: Identify if the scientist installed packages manually (breaking idempotency) or if an external dependency changed. 4) Long-term prevention: Refactor the IaC to use immutable machine images (Packer-built AMIs) or containerize the entire training environment, managed by the IaC, ensuring the environment is always rebuilt from code, not mutated.

Answer Strategy

Test for migration planning, risk management, and stakeholder alignment. Answer should outline a phased approach. Priorities: 1) **Discovery & State Capture (Weeks 1-2):** Audit existing infrastructure, document all configurations and dependencies, and create a 'baseline' Terraform state via `terraform import`. 2) **Value Delivery & Quick Win (Weeks 3-6):** Target the most painful, reproducible component first (e.g., the experiment tracking server) and codify it. Deliver a clear win: a one-click, repeatable deployment. 3) **Foundation & Governance (Weeks 7-12):** Establish the CI/CD pipeline for IaC, implement basic policy-as-code (tagging, instance size limits), and train the first batch of ML engineers on the self-service workflow. Do not attempt to boil the ocean; show incremental value.