Skill Guide

Container and Kubernetes security for ML model serving (image scanning, network policies, Pod Security Standards)

The application of Kubernetes-native security primitives-specifically image vulnerability scanning, microsegmentation via network policies, and workload hardening with Pod Security Standards-to protect the deployment and runtime of ML models from supply chain threats, lateral movement, and privilege escalation.

This skill is critical because ML model serving often involves handling sensitive data and expensive compute resources, making it a high-value target; securing the serving stack prevents costly data breaches, model theft, and service disruption, directly protecting revenue and intellectual property. It enables organizations to deploy ML models at scale in production with confidence, meeting compliance requirements (like GDPR, HIPAA) and reducing operational risk.

1 Careers

1 Categories

9.2 Avg Demand

15% Avg AI Risk

How to Learn Container and Kubernetes security for ML model serving (image scanning, network policies, Pod Security Standards)

Focus 1: Understand the Kubernetes security model: control plane components (API server, etcd), node components (kubelet, container runtime), and the pod lifecycle. Focus 2: Master the core security primitives: Pod Security Standards (Privileged, Baseline, Restricted) and their implications. Focus 3: Learn the basics of container image security: what a CVE is, how to read a vulnerability report, and the difference between base image vulnerabilities and application layer vulnerabilities.

Move from theory to practice by integrating security into the ML deployment pipeline. Scenario: You have a TensorFlow Serving image. You need to scan it for vulnerabilities, enforce that only our specific serving images run in the `ml-serving` namespace, and block all unnecessary network traffic to and from the serving pods. Common Mistake: Applying a blanket 'Restricted' policy without adjusting securityContext (e.g., setting `runAsNonRoot`, `allowPrivilegeEscalation: false`, dropping ALL capabilities) for your specific model server image, causing pods to fail to start. Avoid this by iteratively testing policies in a staging namespace.

Mastery involves designing and automating a comprehensive security posture for a multi-tenant ML platform. Focus on creating policy-as-code guardrails (e.g., using OPA/Gatekeeper or Kyverno) that automatically enforce image scanning requirements, network segmentation rules, and PSS standards across all clusters. Architect for zero-trust principles: every pod-to-pod communication must be explicitly allowed via Calico or Cilium policies, and model endpoints should be exposed only via a service mesh (like Istio) with strict mTLS and authorization policies. Mentor teams on threat modeling for ML systems (e.g., adversarial inputs exploiting a model server vulnerability).

Practice Projects

Beginner

Project

Harden a Single Model Serving Pod

Scenario

You have a pre-built container image for a scikit-learn model serving API (using Flask in a container). Deploy it to a local Kind or Minikube cluster and apply the 'Restricted' Pod Security Standard.

How to Execute

1. Pull the image and scan it locally with Trivy: `trivy image your-flask-sklearn-image:latest`. Document the high/critical vulnerabilities. 2. Create a deployment YAML. Under `spec.template.spec.securityContext`, set `runAsNonRoot: true`, `runAsUser: 1000`, `seccompProfile: type: RuntimeDefault`. Under the container's `securityContext`, set `allowPrivilegeEscalation: false`, `capabilities.drop: ['ALL']`. 3. Apply the deployment and use `kubectl describe pod` to debug and fix any permission errors (e.g., the app needing write access to a specific directory-fix this by using an `emptyDir` volume). 4. Verify the pod runs successfully with the hardened context.

Intermediate

Project

Implement Image Scanning & Network Segmentation for an ML Pipeline

Scenario

Your team deploys models via a CI/CD pipeline (e.g., GitHub Actions) into a GKE cluster. You need to block deployments of images with critical CVEs and ensure the model serving pods can only receive traffic from the upstream API gateway and only talk to a specific Redis cache, not the internet or other services.

How to Execute

1. **Pipeline Integration:** Add a step in your GitHub Actions workflow to build the model image, then scan it with `trivy` or a cloud-native scanner (e.g., GCR Vulnerability Scanning). Fail the pipeline if any CRITICAL or HIGH CVEs are found. 2. **Admission Control:** Deploy a policy agent like OPA/Gatekeeper with a constraint template that denies any Pod creation if its image does not have a recent scan report (e.g., from a trusted registry). 3. **NetworkPolicy Creation:** Write a NetworkPolicy YAML for the `model-serving` namespace: define an `ingress` rule allowing traffic only from the namespace/pod selector of your API gateway on the specific port (e.g., 8501 for TF Serving). Define an `egress` rule allowing traffic only to the Redis service IP/port on port 6379. Apply it with `kubectl apply -f`.

Advanced

Project

Build a Secure, Multi-Tenant ML Serving Platform with Policy as Code

Scenario

You are the platform engineer for a company where multiple data science teams deploy models. You must ensure that no team can run an unscanned image, all workloads are isolated (no cross-team network access), and all pods adhere to a strict security baseline, without each team needing to configure this manually.

How to Execute

1. **Define Baselines:** Create a Git repository containing all your security policies as code (Kyverno ClusterPolicies, Calico GlobalNetworkPolicies). Define a 'Restricted' PSS policy as a Kyverno policy that automatically mutates and validates pods. 2. **Enforce Image Provenance:** Use Kyverno with cosign signatures: create a policy that only allows images signed by your build system's key to be deployed. Integrate this with your image registry (e.g., Harbor) and scanning pipeline. 3. **Automate Network Segmentation:** Use Calico's GlobalNetworkPolicy resources with label selectors. For example, create a policy that denies all ingress/egress traffic in the cluster by default. Then, create namespaced policies that are applied via a GitOps tool (like Argo CD) to each team's namespace, explicitly allowing only the required flows (e.g., from their own web frontend to their model server, and from their model server to their data cache). 4. **Audit & Runtime Security:** Deploy a runtime security tool like Falco. Create rules to alert if a model serving container unexpectedly spawns a shell, makes an outbound connection to a crypto-mining pool, or reads sensitive files. Feed alerts into your SIEM.

Tools & Frameworks

Image Scanning & Supply Chain Security

TrivySnyk ContainerHarbor (with built-in scanning)Cosign / Sigstore

Use Trivy for fast, local, and CI-integrated vulnerability scanning. Snyk Container provides developer-friendly fix advice. Harbor acts as a secure, private registry with integrated scanning and content trust (Cosign). Use Cosign to sign images and Kyverno/OPA to enforce that only signed images are deployed.

Policy Enforcement & Admission Control

KyvernoOpen Policy Agent (OPA) / GatekeeperPod Security Standards (built-in)

Kyverno and OPA/Gatekeeper are Kubernetes-native policy engines. Use them to define complex, declarative policies (e.g., 'All images must have a scan report', 'All pods must have a specific label'). They work alongside the built-in Pod Security Admission, which is simpler for enforcing the predefined PSS levels (Privileged/Baseline/Restricted).

Network Security & Service Mesh

CalicoCiliumIstioLinkerd

Calico and Cilium provide rich, high-performance NetworkPolicy implementations and additional features like global network policies and encryption. Istio and Linkerd (service meshes) add a layer of security on top, offering automatic mTLS for encrypted pod-to-pod traffic, fine-grained L7 authorization policies (e.g., allow POST requests to `/v1/models/iris:predict` only from service A), and robust observability.

Interview Questions

Answer Strategy

Structure your answer around the 'defense in depth' model, covering build, deploy, and runtime. Sample Answer: 'First, in the build stage, I'd implement a CI pipeline that builds the model server image using a minimal, distroless base image and runs Trivy to fail the build on any critical CVEs. The image is then signed with Cosign. For deployment, I'd configure Kyverno to act as a validating admission webhook-policies would deny any pod whose image isn't signed by our trusted key and hasn't passed a scan. Finally, for runtime, I'd enforce the 'Restricted' Pod Security Standard via namespace labels and use a NetworkPolicy to restrict the pod's communication to only the approved API gateway and internal monitoring endpoints.'

Answer Strategy

Tests hands-on debugging experience with security contexts. The core competency is systematic troubleshooting. Sample Answer: 'I'd start by checking the pod events with `kubectl describe pod <pod-name>`-this often shows the exact reason, like failing to run as root or lacking permission to write to a directory. Next, I'd inspect the pod's spec: does the container image have a `USER` instruction? If not, I need to set `runAsUser` in the securityContext. If it's a write error, I'd check if the app needs a writable filesystem and provide an `emptyDir` volume, or adjust the `readOnlyRootFilesystem` setting. I'd also check if any required Linux capabilities were dropped. The goal is to iteratively adjust the securityContext to meet the policy without breaking functionality.'