Skill Guide

Infrastructure-as-code security review for ML serving (Terraform, Kubernetes manifests for model endpoints)

The systematic audit and analysis of Infrastructure-as-Code (IaC) templates-primarily Terraform and Kubernetes manifests-used to provision and configure ML model serving infrastructure, with a focus on identifying security misconfigurations, excessive privileges, and attack surface exposure.

This skill is critical for preventing ML platform compromises that can lead to model theft, data exfiltration, or adversarial attacks. It directly reduces organizational risk and operational overhead by shifting security left, ensuring production ML systems are secure-by-design and compliant before deployment.

1 Careers

1 Categories

9.1 Avg Demand

18% Avg AI Risk

How to Learn Infrastructure-as-code security review for ML serving (Terraform, Kubernetes manifests for model endpoints)

1. Core IaC & K8s Fundamentals: Understand Terraform resource blocks, provider configurations, and basic Kubernetes objects (Deployment, Service, Ingress). 2. Security Principles: Learn the principle of least privilege, network segmentation, and the CIS Benchmarks for Kubernetes and Terraform. 3. Tooling Basics: Get hands-on with static analysis tools like `terraform validate` and `kubeval` for manifest syntax checking.

1. Deep Dive into ML-Specific Resources: Review Terraform modules for services like AWS SageMaker Endpoints, Google Vertex AI Endpoints, or Azure ML Managed Online Endpoints. Audit K8s manifests for custom model servers (e.g., TFServing, Triton) including resource requests/limits, security contexts, and pod security policies. 2. Common Pitfalls: Identify over-permissive IAM roles attached to model endpoints, exposed model artifacts in public storage buckets, and missing network policies allowing pod-to-pod lateral movement. 3. Shift-Left Integration: Integrate security scanners (Checkov, tfsec, Polaris) into CI/CD pipelines for Terraform and Kubernetes.

1. Threat Modeling for ML Infrastructure: Conduct structured threat modeling (e.g., using STRIDE) on the ML serving stack, focusing on data poisoning via model endpoint compromise, model extraction attacks, and denial-of-service via resource exhaustion. 2. Policy-as-Code: Develop and enforce custom security policies using Open Policy Agent (OPA) Gatekeeper or Kyverno for Kubernetes, and Sentinel or custom Terraform policies, to codify organizational security standards. 3. Architecture & Governance: Design secure, multi-tenant ML platform blueprints. Mentor engineering teams on secure deployment patterns and lead incident response drills for ML infrastructure breaches.

Practice Projects

Beginner

Project

Audit a Simple TensorFlow Serving Deployment

Scenario

You are given a set of Terraform and Kubernetes YAML files to deploy a TensorFlow Serving model endpoint on a managed Kubernetes cluster (e.g., GKE, EKS). The setup includes a deployment, service, and a basic ingress.

How to Execute

1. Run `terraform plan` and `kubectl apply --dry-run=client` to visualize the resources to be created. 2. Use `tfsec` and `kube-bench` to scan the IaC files for known misconfigurations (e.g., container running as root, missing resource limits). 3. Manually review the IAM roles and policies attached to the underlying cloud nodes for overly permissive access (e.g., full S3 access instead of read-only on the model bucket). 4. Document and remediate the top 3 findings, prioritizing those with direct impact on model security.

Intermediate

Project

Secure a Multi-Model Endpoint with Canary Deployment

Scenario

Your team uses a single Kubernetes Deployment and Ingress to serve multiple ML models via a custom Python server. You need to implement a canary deployment strategy while ensuring the new model version has no more privileges than necessary and cannot access the other models' artifacts.

How to Execute

1. Refactor the manifest to use separate Deployments and Services for each model version to isolate blast radius. 2. Implement a Kubernetes `NetworkPolicy` to restrict traffic: only the ingress controller can talk to model pods, and model pods cannot talk to each other. 3. Define separate, least-privilege IAM roles (via IRSA on EKS or Workload Identity on GKE) for each Deployment, scoping each to its specific model artifact storage path. 4. Use a canary controller (e.g., Istio, Nginx Ingress annotations) and validate that the security context and network policies apply correctly to both primary and canary pods.

Advanced

Project

Design and Enforce a Secure ML Platform Template

Scenario

As a platform engineer, you must create a reusable Terraform module and set of OPA policies that any data science team can use to deploy a model endpoint securely. The template must support multiple cloud providers, enforce network segmentation, mandate logging and monitoring, and prevent common ML-specific vulnerabilities.

How to Execute

1. Architect a Terraform module with inputs for model location, compute type, and security level. The module outputs a set of K8s manifests. 2. Embed OPA/Rego policies within the module that run on `terraform plan` to enforce rules like: 'All S3 buckets must be private', 'K8s containers must not run as root', 'All endpoints must have an ingress rate limit'. 3. Integrate the module with a CI/CD system that only applies configurations that pass the embedded policy checks. 4. Conduct a tabletop exercise simulating a model compromise, tracing the attack path through your hardened template to prove its effectiveness.

Tools & Frameworks

Static Analysis & Policy Engines

CheckovtfsecPolatis/KyvernoOpen Policy Agent (OPA)

These tools scan IaC and K8s manifests for security misconfigurations and allow you to define custom, enforceable security policies. Use them in CI/CD pipelines to block insecure deployments.

ML Platform & Runtime Tools

AWS SageMaker / Vertex AI / Azure MLIstio Service MeshModel Servers (TFServing, Triton, TorchServe)

Deep knowledge of the IaC and security features of managed ML platforms is essential. Service meshes like Istio provide fine-grained traffic control and mTLS for model endpoints.

Security Frameworks & Standards

CIS Benchmarks (Kubernetes, Cloud Providers)STRIDE Threat ModelingNIST SP 800-53

These provide structured, industry-vetted lists of security controls and threat categories to systematically evaluate your ML infrastructure against.

Interview Questions

Answer Strategy

The answer should demonstrate a structured, layered approach. Start by describing the use of static analysis tools (`tfsec`) to catch low-hanging fruit. Then, move to manual review focusing on IAM: ensuring the SageMaker execution role has minimal permissions (e.g., only `s3:GetObject` on the specific model artifact prefix, no broad admin policies). Next, discuss network configuration: verifying the endpoint is deployed within a VPC with no public IP, and security groups restrict traffic to only the application backend. Finally, mention logging: ensuring CloudWatch Logs are enabled and encrypted. Sample Answer: 'First, I'd run tfsec to identify any flagged resources. Manually, I'd scrutinize the IAM policy attached to the SageMaker execution role, ensuring it follows least privilege-for example, scoped to a single S3 model bucket. I'd verify the endpoint is VPC-isolated with security groups allowing ingress only from the internal service network. Finally, I'd check that all logging and model data input/output are encrypted at rest and in transit via KMS.'

Answer Strategy

Tests collaboration, communication, and technical depth. The strategy is to explain the *why* behind the policy, provide a concrete fix, and focus on enabling them. Sample Answer: 'I'd schedule a quick call to walk through the report, explaining that the `securityContext: {privileged: true}` flag they used, while convenient for debugging, gives the container full host access-a critical risk for a model server. I'd provide a modified manifest showing how to achieve their goal (e.g., accessing a GPU) using a specific `resource` request and a non-root user with the appropriate `capabilities` instead. I'd emphasize that our goal is to enable their work securely, not block it.'