Learning Roadmap
How to Become a AI Platform Engineer
A step-by-step, phase-based learning path from beginner to job-ready AI Platform Engineer. Estimated completion: 9 months across 6 phases.
Progress saved in your browser — no account needed.
-
Cloud Infrastructure & Containers Fundamentals
6 weeksGoals
- Master Docker containerization including multi-stage builds and GPU-enabled containers
- Build proficiency in Kubernetes fundamentals: pods, services, deployments, persistent volumes
- Gain hands-on experience with one major cloud provider's compute and networking services (AWS preferred)
- Learn Infrastructure as Code basics with Terraform
Resources
- Kelsey Hightower's 'Kubernetes the Hard Way'
- AWS / GCP / Azure free-tier ML services documentation
- HashiCorp Terraform Associate certification materials
- Docker official documentation and tutorials
MilestoneDeploy a containerized application to a managed Kubernetes cluster provisioned via Terraform
-
ML Engineering Foundations
6 weeksGoals
- Understand the ML lifecycle: data preparation, training, evaluation, deployment, monitoring
- Learn Python ML ecosystem basics (scikit-learn, pandas, numpy) at a working-proficiency level
- Familiarize yourself with MLflow or Weights & Biases for experiment tracking
- Understand model serialization formats (ONNX, TorchScript, SafeTensors)
Resources
- Andrew Ng's 'Machine Learning Specialization' (Coursera)
- Made With ML by Goku Mohandas
- MLflow official tutorials
- FastAPI documentation for building model serving endpoints
MilestoneTrain a model, track experiments in MLflow, and serve it via a REST API
-
MLOps & Model Serving Infrastructure
8 weeksGoals
- Deploy and operate KServe or Seldon Core on Kubernetes for model inference
- Build CI/CD pipelines for model artifacts using GitHub Actions or GitLab CI
- Learn GPU scheduling in Kubernetes (node selectors, tolerations, device plugins, MIG)
- Implement model monitoring with Prometheus, Grafana, and custom metrics
Resources
- KServe documentation and examples
- NVIDIA GPU Operator documentation
- Coursera 'MLOps Specialization' by DeepLearning.AI
- Prometheus + Grafana official docs
MilestoneDeploy a multi-model serving platform on Kubernetes with automated CI/CD, GPU scheduling, and observability dashboards
-
LLM Infrastructure & RAG Platforms
8 weeksGoals
- Deploy and manage vector databases (Qdrant, Weaviate, or pgvector) for RAG workloads
- Operate vLLM or TGI for efficient LLM inference with quantization and batching
- Build RAG pipelines integrating embedding models, vector stores, and LLM endpoints
- Implement LLMOps practices: prompt management, token cost tracking, guardrails, and evaluation
Resources
- vLLM documentation and benchmarks
- LangChain and LlamaIndex documentation
- Vector database provider documentation (Qdrant, Weaviate, Pinecone)
- Anthropic / OpenAI API documentation and best practices guides
MilestoneBuild and operate a production RAG platform with vector search, LLM serving, prompt management, and cost/quality monitoring
-
Platform Engineering & Developer Experience
6 weeksGoals
- Design self-service platform APIs and CLIs that abstract infrastructure complexity
- Implement multi-tenancy patterns with resource quotas, namespace isolation, and billing
- Build internal developer portals (Backstage or custom) for ML platform users
- Master cost optimization strategies for GPU-heavy workloads (spot, reserved, right-sizing)
Resources
- Platform Engineering community resources (platformengineering.org)
- Spotify Backstage documentation
- Cloud provider cost management tools documentation
- Internal Developer Platforms (IDP) architecture patterns
MilestoneDesign and document a complete AI platform architecture with self-service workflows, multi-tenancy, and cost governance
-
Advanced Topics & Job Preparation
4 weeksGoals
- Study agent orchestration infrastructure (LangGraph, CrewAI, AutoGen) and tool-calling platforms
- Learn advanced networking for ML (RDMA, InfiniBand, high-bandwidth interconnects for distributed training)
- Build a portfolio project demonstrating end-to-end AI platform capabilities
- Prepare for system design interviews focused on AI/ML infrastructure
Resources
- LangGraph and CrewAI documentation
- NVIDIA NCCL and multi-node training documentation
- System design interview resources adapted for ML infrastructure
- Open-source AI platform projects (MLRun, Flyte, Metaflow) for architectural inspiration
MilestoneConfidently design and defend an AI platform architecture in a senior-level system design interview
Practice Projects
Apply your skills with hands-on projects. Ordered by difficulty.
ML Model Serving Platform on Kubernetes
IntermediateDeploy KServe on a local (kind/minikube) or cloud Kubernetes cluster. Configure model serving for a scikit-learn model and a HuggingFace transformer model, implement autoscaling based on request rate, set up basic monitoring with Prometheus and Grafana, and create a CI/CD pipeline that automatically deploys model updates from a Git repository.
RAG Platform with Vector Search and LLM Serving
AdvancedBuild a complete RAG infrastructure stack: deploy Qdrant or Weaviate on Kubernetes, set up an embedding model service, implement document ingestion pipelines with chunking strategies, deploy a vLLM instance serving an open-source LLM, and create a unified API gateway that orchestrates retrieval and generation. Include observability for token costs, retrieval quality, and end-to-end latency.
GPU Cluster Cost Optimization Dashboard
IntermediateBuild a cost monitoring and optimization tool for GPU workloads. Collect metrics on GPU utilization, memory usage, and job duration across namespaces. Implement alerts for idle GPUs, recommendations for right-sizing, and a report comparing spot vs. on-demand costs. Visualize in Grafana and integrate with cloud billing APIs.
Self-Service ML Deployment CLI
IntermediateBuild a Python CLI tool (using Click or Typer) that allows ML engineers to deploy models with a single command (e.g., 'ml-deploy deploy --model ./model.pkl --framework sklearn'). The CLI should handle container image building, Kubernetes manifest generation, secret injection, and deployment verification. Include support for rollback and status checking.
Multi-Tenant AI Platform with Quota Management
AdvancedDesign and implement a multi-tenant AI platform on Kubernetes with namespace-based isolation, ResourceQuotas for GPU/CPU/memory per team, a simple web portal (or API) for tenant onboarding and quota requests, per-namespace monitoring dashboards, and a basic billing/metering system that tracks resource consumption by tenant.
LLM Gateway with Provider Abstraction and Caching
AdvancedBuild an API gateway that abstracts multiple LLM providers (OpenAI, Anthropic, self-hosted vLLM) behind a unified interface. Implement intelligent routing (by model capability, cost, latency), semantic caching for repeated queries, rate limiting per team, token usage tracking, and automatic failover between providers. Deploy on Kubernetes with proper observability.
MLOps Pipeline with Automated Evaluation Gates
BeginnerCreate an end-to-end ML pipeline using Argo Workflows or Kubeflow Pipelines that automates data validation, model training, evaluation against holdout metrics, conditional deployment (only if quality thresholds are met), and post-deployment smoke testing. Track all runs in MLflow.
Infrastructure as Code for a Complete AI Platform
AdvancedWrite comprehensive Terraform or Pulumi modules that provision a complete AI platform from scratch: Kubernetes cluster with GPU node pools, VPC/networking, vector database, model registry, monitoring stack, CI/CD runners, and secret management. Include environment separation (dev/staging/prod) and documentation for team onboarding.
Ready to Start Your Journey?
Prep for interviews alongside your learning — it reinforces every concept.