AI Work Order Automation Specialist
An AI Work Order Automation Specialist designs, deploys, and optimizes intelligent systems that automatically generate, classify, …
Skill Guide
The systematic practice of collecting, analyzing, and acting on operational data from automated systems to measure their performance, diagnose issues, and iteratively enhance their reliability, efficiency, and business impact.
Scenario
You have a Python Flask or Node.js Express API endpoint (`/api/users`). You need to monitor its health and performance.
Scenario
Your 'Payment Service' has a business-defined SLO: 99.9% of requests must complete successfully within 200ms over a 30-day rolling window.
Scenario
Your organization runs a high-traffic e-commerce platform. A common incident is the 'Product Catalog Service' becoming overloaded during flash sales, causing checkout failures.
Used for collecting, storing, querying, and visualizing all pillars of observability. Choose based on scale, cost model (per-host vs. per-GB), and ecosystem integration. Grafana Stack is a powerful open-source alternative.
OTel is the vendor-neutral standard for generating and exporting telemetry data. Prometheus is the standard for metric scraping and storage, especially in Kubernetes environments.
Used to manage alerts, escalate incidents, run post-mortems, and track the 'reliability debt' work items that are the output of the feedback loop. Integrates with observability platforms.
These are mental models for knowing *what* to measure. The USE method is for resources (CPU, memory). The RED method and Four Golden Signals are for request-driven services. The SRE Workbook provides the playbook for implementing the feedback loop.
1 career found
Try a different search term.