AI Logging & Monitoring Engineer
An AI Logging & Monitoring Engineer designs, implements, and maintains the critical observability infrastructure for AI/ML systems…
Skill Guide
The systematic process of creating clear, actionable, and version-controlled written artifacts that explain systems, procedures, and incident response steps to enable operational efficiency and knowledge transfer.
Scenario
You have built a simple REST API for a todo list application. The project repository lacks any documentation.
Scenario
Your team manages a PostgreSQL database cluster with a primary and two read replicas. An automated alert fires: 'Primary database node unresponsive.'
Scenario
As a platform lead, you need to ensure critical operational procedures are not only documented but also executable and actively maintained across 50+ microservices.
Use Git-based wikis for version-controlled, code-adjacent documentation. Use Confluence/Notion for broader team collaboration. Use Swagger for auto-generating API reference docs from code annotations. Use Read the Docs for building, versioning, and hosting documentation from Sphinx/MkDocs projects.
Rundeck and StackStorm are dedicated platforms for defining, scheduling, and running operational workflows with RBAC and audit trails. AWS SSM Documents are a serverless way to run remediation scripts across your fleet. Use Ansible for agentless, playbook-driven execution across heterogeneous infrastructure.
Diátaxis provides a robust taxonomy for structuring docs into Tutorials, How-Tos, Explanations, and Reference. ADRs capture the 'why' behind technical decisions. Enforce a PR review workflow for documentation changes to ensure technical accuracy and clarity, treating docs as code.
1 career found
Try a different search term.