Skill Guide

Container and Kubernetes security for model serving environments (pod security policies, network policies)

The practice of hardening the runtime environment for machine learning models deployed on Kubernetes by implementing controls that restrict pod capabilities and enforce micro-segmented, least-privilege network communication.

This skill is critical for protecting high-value intellectual property (model weights) and sensitive training/inference data from lateral movement and container escapes. Directly mitigates operational risk in MLOps pipelines, ensuring compliance and availability for revenue-critical AI services.

1 Careers

1 Categories

9.1 Avg Demand

15% Avg AI Risk

How to Learn Container and Kubernetes security for model serving environments (pod security policies, network policies)

1. Core Kubernetes Security Primitives: Understand the fundamental shift from PodSecurityPolicy (PSP, deprecated) to Pod Security Admission (PSA) and the three policy levels (privileged, baseline, restricted). 2. Container Runtime Fundamentals: Grasp Linux namespaces, cgroups, and seccomp profiles to understand the isolation boundaries you are securing. 3. Default Network Policies: Learn to write a basic NetworkPolicy that denies all ingress/egress by default.

1. Implementing a Security Baseline: Deploy a model serving stack (e.g., using KFServing, Seldon Core) into a namespace enforcing the 'restricted' PSA policy. 2. Debugging Breakages: Identify and fix permission errors (e.g., CAP_NET_BIND_SERVICE, non-root user) when a model container fails to start under strict policy. 3. Advanced Network Segmentation: Create explicit NetworkPolicies that allow traffic only from a specific model gateway or controller to a model-serving pod, blocking all other namespaces.

1. Policy-as-Code & Drift Management: Use tools like Kyverno or OPA/Gatekeeper to write custom admission webhook policies that, for example, block any model container image not signed by Cosign. 2. Runtime Threat Detection: Integrate Falco to monitor for suspicious syscalls (e.g., writing to /proc) inside a model-serving pod. 3. Architectural-Level Zero Trust: Design a service mesh (Istio, Linkerd) configuration with strict mTLS and authorization policies for model A/B testing and canary deployments.

Practice Projects

Beginner

Project

Harden a Basic TensorFlow Serving Deployment

Scenario

You have a single TensorFlow Serving container running as root in the default namespace. Your task is to secure it for a staging environment.

How to Execute

1. Create a new namespace 'model-staging' and label it with PSA 'restricted'. 2. Modify the Deployment YAML: set 'securityContext.runAsNonRoot: true', drop all capabilities, add a read-only root filesystem, and define a seccomp profile. 3. Deploy a NetworkPolicy in that namespace that only allows ingress traffic on port 8501 from the 'ingress-controller' namespace. 4. Test the deployment and fix any startup issues caused by the strict policies.

Intermediate

Project

Implement a Multi-Tenant Model Serving Platform with Network Isolation

Scenario

Your platform team must host models for two different product teams, 'Team-A' and 'Team-B', on the same cluster, ensuring neither can access the other's model endpoints or underlying storage.

How to Execute

1. Create dedicated namespaces 'team-a-servings' and 'team-b-servings', each with the 'restricted' PSA label. 2. For each namespace, deploy a default-deny NetworkPolicy. 3. Create specific ingress NetworkPolicies for each model service, allowing traffic only from the shared ingress gateway pod (by pod label). 4. Define RBAC roles for each team, granting them permissions only within their own namespace. 5. Use a service mesh or explicit egress policies to control access to external object storage (e.g., S3 buckets) for each team.

Advanced

Project

Audit and Enforce Supply Chain Security for Model Containers

Scenario

Your security audit reveals that developers are deploying unverified container images containing model code and weights from public registries. You must enforce a policy that only allows images signed by your internal CI system.

How to Execute

1. Deploy OPA/Gatekeeper with the 'k8sallowedrepos' constraint template to block images from untrusted registries. 2. Configure a Cosign-based admission webhook (or use Kyverno's 'verifyImages' rule) to require a valid signature from your company's Cosign key pair. 3. Update your CI/CD pipeline to sign images upon build. 4. Write a Gatekeeper 'ConstraintTemplate' that inspects container resource requests to ensure model pods request adequate CPU/Memory limits (preventing noisy neighbor issues). 5. Test by attempting to deploy an unsigned image and verifying the webhook blocks it.

Tools & Frameworks

Policy & Admission Control

Kubernetes Pod Security Admission (PSA)OPA/GatekeeperKyverno

PSA is the built-in, label-based baseline for pod security. Gatekeeper and Kyverno are advanced policy engines for writing custom, context-aware rules (e.g., image signing, label enforcement) as Kubernetes admission webhooks.

Network Security & Observability

CNI Plugins with NetworkPolicy support (Calico, Cilium)Service Mesh (Istio, Linkerd)Falco (Runtime Threat Detection)

Calico/Cilium provide the underlying engine to enforce NetworkPolicies. Service Meshes add mTLS and fine-grained L7 authorization. Falco monitors system calls for real-time anomaly detection inside containers.

Supply Chain & Runtime

Cosign/Sigstore (Image Signing)Seccomp (System Call Filtering)Trivy/Grype (Vulnerability Scanning)

Cosign ensures image provenance. Seccomp profiles restrict the syscalls a container can make. Scanners find CVEs in base images and dependencies before deployment.

Interview Questions

Answer Strategy

The interviewer is testing hands-on debugging skills and knowledge of the policy's components. Focus on a systematic, checklist-based approach. Sample answer: 'I would first check the pod events for the specific violation, which typically references a missing or disallowed field. I'd inspect the deployment YAML against the restricted policy checklist: ensuring runAsNonRoot is true, all capabilities are dropped, seccompProfile is RuntimeDefault or Localhost, and the container is not running as root. A common fix is adding an explicit securityContext and updating the Dockerfile to run as a non-root user.'

Answer Strategy

This tests risk assessment and the ability to apply principles of least privilege and isolation. Sample answer: 'I would not grant this in a production namespace. Instead, I'd create a dedicated namespace labeled with the 'privileged' PSA policy for this specific workload. I'd then apply aggressive network segmentation: a NetworkPolicy that completely blocks all ingress/egress to/from this namespace except for a specific, monitored management jump-box. I would also schedule a review date to retire this exception. This contains the blast radius while meeting the immediate need.'