Skill Guide

Infrastructure as Code (Terraform, Pulumi) for ML platform resources

The practice of using declarative code (Terraform, Pulumi) to provision, configure, and manage cloud infrastructure resources (compute, storage, networking) specifically required to train, deploy, and serve machine learning models at scale.

This skill enables reproducible, scalable, and auditable ML environments, directly accelerating time-to-production and reducing operational risk. It shifts infrastructure from a manual, error-prone cost center to a version-controlled, automated enabler of MLOps and model velocity.

1 Careers

1 Categories

9.2 Avg Demand

15% Avg AI Risk

How to Learn Infrastructure as Code (Terraform, Pulumi) for ML platform resources

1. Core IaC Concepts: Understand state management, providers, resources, and modules in Terraform or Pulumi's programmatic model (e.g., Python/TypeScript). 2. Cloud Provider Fundamentals: Master the AWS (SageMaker, EKS, S3), GCP (Vertex AI, GKE, Cloud Storage), or Azure (Azure ML, AKS, Blob) resource hierarchy and IAM. 3. Basic ML Infra: Start with provisioning a single-purpose compute instance (e.g., an EC2 GPU instance) and a storage bucket for data.

1. Composable Modules: Build reusable Terraform modules or Pulumi component resources for recurring ML patterns (e.g., a 'training-cluster' module that includes an Auto Scaling Group, security groups, and an S3 mount). 2. CI/CD Integration: Implement a pipeline (GitHub Actions, GitLab CI) that runs `terraform plan` on PR and `terraform apply` on merge to main. 3. Avoid Common Pitfalls: Never hardcode secrets; use a secrets manager (AWS Secrets Manager, HashiCorp Vault). Understand state locking and remote state backends (S3, GCS) to prevent corruption.

1. Multi-Environment Strategy: Design a promotion workflow (dev -> staging -> prod) using workspaces or directory separation, with policy-as-code (Sentinel, OPA) to enforce compliance (e.g., no public buckets, mandatory encryption). 2. Platform Engineering: Architect a self-service platform where data scientists can request resources (e.g., a Jupyter Hub, a serving endpoint) via a simple API or CLI, backed by IaC templates you maintain. 3. Cost & Drift Management: Implement automated cost estimation in CI, schedule non-prod resource teardown, and establish drift detection workflows.

Practice Projects

Beginner

Project

Provision a Single ML Training Environment

Scenario

You need a reproducible environment for a data scientist to fine-tune a model on a single GPU instance, with access to a private S3 data bucket.

How to Execute

1. Write Terraform code for: an EC2 instance with a Deep Learning AMI, an IAM instance profile granting read-only S3 access, a security group allowing SSH from your IP only. 2. Define outputs for the public IP and instance ID. 3. Use `terraform init`, `plan`, and `apply`. 4. Connect via SSH, download data, and run a training job. 5. Destroy the stack with `terraform destroy` when done.

Intermediate

Project

Build an Auto-Scaling Inference Service

Scenario

Deploy a containerized ML model behind a load balancer that scales based on CPU utilization, with zero-downtime deployments.

How to Execute

1. Use Terraform to provision: an EKS cluster (AWS) or GKE (GCP), a managed node group with GPU nodes, a Kubernetes namespace. 2. Write Terraform code to deploy a Kubernetes Deployment and Horizontal Pod Autoscaler (HPA) via the `kubernetes` provider. 3. Use a Helm release or a raw YAML manifest for the model serving container (e.g., TensorFlow Serving). 4. Set up a CI/CD pipeline that builds a new container image, updates the Kubernetes manifest, and applies the change via `terraform apply`.

Advanced

Project

Design a Multi-Tenant ML Platform with Self-Service

Scenario

Create an internal platform where different teams can provision isolated ML workspaces (JupyterHub), training clusters, and model endpoints with guardrails on cost and security.

How to Execute

1. Architect a Terraform/Pulumi codebase with a shared module library and per-team root modules. 2. Implement a backend service (e.g., in Python) that wraps `terraform` commands, accepting a team ID and desired resource types, then orchestrates the apply in a dedicated workspace. 3. Integrate policy-as-code (OPA) to validate that all requests adhere to organization standards (e.g., `max_cpu_per_team` constraints). 4. Build a simple UI or CLI for teams to trigger provisions and view their state. 5. Set up monitoring (Prometheus) and cost alerts (AWS Cost Explorer tags) per team.

Tools & Frameworks

Software & Platforms

TerraformPulumiAWS CloudFormationGoogle Cloud Deployment Manager

Terraform is the industry standard with a declarative HCL syntax and vast provider ecosystem. Pulumi allows using general-purpose programming languages (Python, TypeScript) for more complex logic. CloudFormation and Deployment Manager are native alternatives but are less portable across clouds.

MLOps & Orchestration

Kubernetes (EKS/GKE/AKS)SageMaker PipelinesMLflowKubeflow

Kubernetes is the foundational runtime for containerized ML workloads, often provisioned and managed via IaC. SageMaker Pipelines, MLflow, and Kubeflow are higher-level ML workflow orchestrators that themselves require underlying infrastructure (nodes, storage) managed by IaC.

Testing & Security

TerratestCheckovHashiCorp SentinelOpen Policy Agent (OPA)

Terratest (Go) is used for integration testing of Terraform modules. Checkov performs static analysis of IaC for security misconfigurations. Sentinel and OPA are policy-as-code frameworks to enforce governance rules before apply.

Interview Questions

Answer Strategy

Demonstrate understanding of separation of concerns, module composition, and Kubernetes RBAC. Sample Answer: 'I would use a two-layer module approach. The first, a 'cluster' module, provisions the EKS/GKE cluster, node pools, and core monitoring. The second, a 'team-namespace' module, is called per team to create a Kubernetes namespace, ResourceQuota, LimitRange, and a RoleBinding that grants the team's IAM group edit access only within their namespace. The cluster module is applied from a central CI/CD pipeline, while the team-namespace modules could be applied from a separate pipeline or through a self-service API that triggers the Terraform apply in a dedicated workspace per team.'

Answer Strategy

Test debugging methodology, knowledge of IAM, and proactive measures. Sample Answer: 'First, I would inspect the Terraform state (`terraform state show`) to verify the exact IAM policy attached to the instance role and the bucket's resource policy. I would check for any `aws_iam_policy_document` resources for explicit denies or missing `s3:PutObject` permissions. To prevent recurrence, I would enhance our Terraform modules to include a default least-privilege policy for ML workloads, integrate a pre-apply policy scanner like Checkov to catch overly permissive policies, and add an integration test using Terratest that attempts a write operation as part of the module validation.'