Learning Roadmap
How to Become a AI AIOps Engineer
A step-by-step, phase-based learning path from beginner to job-ready AI AIOps Engineer. Estimated completion: 8 months across 5 phases.
Progress saved in your browser — no account needed.
-
Foundations: Linux, Networking, and Cloud Infrastructure
6 weeksGoals
- Gain fluency in Linux systems administration, networking fundamentals, and one major cloud provider (AWS recommended)
- Understand the pillars of observability: metrics, logs, traces, and profiling
- Deploy and manage a basic Kubernetes cluster and understand pod lifecycle, services, and ingress
Resources
- Linux Foundation LFS201: Linux System Administration
- AWS Solutions Architect Associate certification path
- Kubernetes.io official tutorials and CKA preparation materials
- Book: 'Site Reliability Engineering' by Google (free online)
MilestoneYou can deploy a containerized microservice on Kubernetes with basic Prometheus monitoring and Grafana dashboards.
-
Programming, Data Engineering, and Observability Pipelines
6 weeksGoals
- Build proficiency in Python and Go for operational tooling and data pipeline development
- Design and operate end-to-end observability pipelines using OpenTelemetry, Kafka, and Elasticsearch
- Understand time-series data structures, storage engines (Prometheus TSDB, InfluxDB), and query optimization
Resources
- Book: 'Designing Data-Intensive Applications' by Martin Kleppmann
- OpenTelemetry official documentation and Collector configuration guides
- Confluent Kafka developer certification materials
- FastAPI and asyncio tutorials for building operational APIs
MilestoneYou can build a multi-source telemetry ingestion pipeline that collects, transforms, and stores operational data from distributed services.
-
Machine Learning for Operational Data
8 weeksGoals
- Master time-series forecasting and anomaly detection techniques including statistical methods and deep learning approaches
- Train, evaluate, and deploy ML models for pattern recognition in logs, metrics, and traces
- Understand MLOps fundamentals: experiment tracking, model versioning, feature stores, and serving infrastructure
Resources
- Coursera: 'Machine Learning Specialization' by Andrew Ng
- Google Cloud: 'Machine Learning for Time Series' course
- MLflow documentation and tutorials
- PyTorch Forecasting library and Darts time-series library documentation
- Papers: 'Opprentice' (opinionated anomaly detection), 'LogRobust' (log-based anomaly detection)
MilestoneYou can train and deploy an anomaly detection model on production telemetry data with proper evaluation metrics and monitoring.
-
LLMs, RAG, and Conversational Operations
6 weeksGoals
- Build retrieval-augmented generation pipelines that index incident histories, runbooks, and postmortems for intelligent retrieval
- Fine-tune or prompt-engineer LLMs for operational domain tasks like incident summarization, root cause hypothesis generation, and runbook recommendation
- Design conversational interfaces that allow natural-language querying of operational data
Resources
- LangChain documentation and cookbook examples
- HuggingFace NLP course and model hub
- OpenAI API documentation and function-calling guides
- Papers: 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks' (Lewis et al.)
- Pinecone or Weaviate vector database documentation
MilestoneYou can build an LLM-powered assistant that ingests postmortems and runbooks, then provides context-aware remediation suggestions during incidents.
-
Self-Healing Systems, Automation, and Production Deployment
8 weeksGoals
- Design closed-loop automation systems that detect, diagnose, and remediate operational incidents without human intervention
- Implement chaos engineering experiments validated by AI-driven observability
- Build production-grade AIOps pipelines with proper CI/CD, monitoring of ML models, and guardrails to prevent autonomous actions from causing harm
Resources
- Gremlin or LitmusChaos for chaos engineering experiments
- Argo Workflows and Temporal for durable workflow orchestration
- Book: 'Chaos Engineering' by Casey Rosenthal and Nora Jones
- AWS Step Functions or Google Cloud Workflows for serverless automation
- PagerDuty Rundeck for automated runbook execution
MilestoneYou can architect and ship a production AIOps system that autonomously detects, correlates, and remediates a class of infrastructure incidents with appropriate human-in-the-loop safeguards.
Practice Projects
Apply your skills with hands-on projects. Ordered by difficulty.
Intelligent Alert Correlation Engine
IntermediateBuild a stream-processing pipeline that ingests alerts from multiple monitoring sources (Prometheus Alertmanager, CloudWatch, PagerDuty), correlates them using temporal and topological proximity, and produces consolidated incident summaries. Implement clustering using DBSCAN on alert feature vectors and a service dependency graph derived from OpenTelemetry traces.
RAG-Powered Intelligent Runbook Assistant
IntermediateBuild a retrieval-augmented generation system that indexes historical postmortems, runbooks, and incident records into a vector database (Chroma or Pinecone). Create a conversational interface using LangChain and OpenAI that answers operational questions, suggests remediation steps, and cites specific historical incidents as evidence.
Predictive Auto-Scaling System
AdvancedDesign an ML-driven auto-scaling system that forecasts service load 30-60 minutes ahead using time-series models (Prophet, NeuralProphet) and triggers Kubernetes HPA adjustments proactively. Include capacity planning dashboards, cost impact analysis, and a feedback loop that compares predicted vs actual load to continuously improve forecast accuracy.
Automated Root Cause Analysis Pipeline
AdvancedBuild an end-to-end RCA system that automatically ingests metrics, logs, and traces during an incident, applies causal discovery algorithms to identify the most likely root cause, and generates a structured incident report. Use Granger causality tests on metric time series and trace-based dependency analysis to construct causal graphs.
Self-Healing Infrastructure with Guardrails
AdvancedDesign a closed-loop automation system that detects known failure patterns (e.g., OOMKilled pods, certificate expiry, disk pressure) and executes pre-approved remediation actions (restart, rotate, expand). Implement a safety framework with blast radius limits, dry-run modes, mandatory approvals for high-risk actions, and comprehensive audit logging. Test with chaos engineering experiments.
Multi-Cloud Observability Normalization Layer
IntermediateBuild a unified observability abstraction layer using OpenTelemetry Collectors that normalizes metrics, logs, and traces from AWS CloudWatch, GCP Cloud Operations, and Azure Monitor into a common schema. Implement a federated query interface that allows searching and correlating data across all three clouds through a single Grafana dashboard.
LLM-Powered Incident Chatbot with Action Execution
BeginnerBuild a Slack bot powered by an LLM that can answer questions about current system health by querying Prometheus and Grafana APIs, summarize recent incidents from PagerDuty, and execute safe read-only diagnostic commands. Focus on grounding LLM responses in real telemetry data and preventing hallucination through structured output parsing.
Ready to Start Your Journey?
Prep for interviews alongside your learning — it reinforces every concept.