AI Infrastructure Engineer
AI Infrastructure Engineers design, build, and maintain the foundational systems that power machine learning workloads at scale - …
Skill Guide
Container optimization for ML is the systematic engineering of Dockerfiles and container images to minimize size, maximize build speed, and ensure correct execution of GPU-accelerated machine learning workloads through efficient CUDA integration, intelligent layer reuse, and reliable model artifact delivery.
Scenario
You have a pre-trained PyTorch model file (`.pt`) that requires CUDA 11.8 and Python 3.9 to run. Your goal is to create a production-ready Docker image under 2GB that can serve the model via a simple FastAPI endpoint.
Scenario
A team's ML training image is 12GB, takes 45 minutes to build on CI, and often fails due to network timeouts during `pip install`. The image includes CUDA, conda, Jupyter, and many scientific packages. You need to reduce build time and image size without breaking the existing training script.
Scenario
You are tasked with creating a system where a model trained in a SageMaker/VertexAI training job is automatically versioned, packaged into a secure, minimal container image, and deployed to a Kubernetes-based inference cluster (e.g., Seldon Core, KServe) with one command. The container must have zero runtime network dependencies to pull the model.
Docker is the core build tool. Use BuildKit features like cache mounts. Podman is useful for rootless, daemonless builds in secure environments. The NVIDIA toolkit is mandatory for passing GPUs into containers.
DVC and W&B track large files (models, datasets) outside Git. Use `conda-pack` to create portable conda environments for reproducible, offline deployments. `pip-tools` pins exact dependency versions for deterministic builds.
CI platforms run automated builds. Use `dive` to analyze image layers for bloat. `hadolint` enforces best practices in Dockerfiles. Integrate `trivy` scans into CI to fail builds on critical vulnerabilities.
Answer Strategy
The interviewer is testing practical knowledge of production constraints vs. development environments, image layer optimization, and the CUDA stack. The answer must distinguish between runtime and compile-time dependencies. Sample Answer: 'First, I'd switch the base image to `nvidia/cuda:11.8-runtime` to exclude the CUDA compiler. Then, I'd implement a multi-stage build: Stage 1 uses the `devel` image to compile any custom ops if needed. Stage 2, the final image, starts from `runtime`, copies only the compiled artifacts and uses a slimmer Python base. I'd also use a `.dockerignore` to exclude all training data, notebooks, and git history. Finally, I'd use `dive` to audit each layer and remove any redundant caches left by `apt-get` or `pip`.'
Answer Strategy
This is a behavioral question testing problem-solving and systems thinking. Focus on a structured approach: observe, hypothesize, test, implement. Sample Answer: 'In a CI pipeline for a model serving image, builds suddenly took 30+ minutes instead of 5. I observed that the `COPY requirements.txt .` layer was being invalidated on every run, despite no file change. I hypothesized the CI runner's filesystem timestamps were changing. I tested by adding `--no-cache` locally and it worked, confirming a cache issue. The fix was to switch our CI to use BuildKit with explicit cache mounts (`--mount=type=cache,target=/root/.cache/pip`) and ensure the CI job environment had consistent timestamps. This decoupled the dependency download from the code copy, restoring cache efficiency.'
1 career found
Try a different search term.