AI ETL Automation Engineer
An AI ETL Automation Engineer designs, builds, and maintains intelligent data pipelines that leverage large language models, embed…
Skill Guide
The systematic design and implementation of fault tolerance for non-deterministic AI model invocations, ensuring workflow continuity through error classification, controlled retry mechanisms, and fallback strategies that maintain core functionality.
Scenario
Build a Python function that calls an LLM API to summarize text, handling 429 (rate limit) and 503 (service unavailable) errors with retries.
Scenario
An AI document processing pipeline has three sequential steps: OCR extraction, entity recognition, and summary generation. A failure in any step should not block the entire pipeline for other requests.
Scenario
Build a system that uses GPT-4 for high-stakes content generation but falls back to GPT-3.5 when GPT-4 latency exceeds 5 seconds or when output validation detects low-confidence results.
Use `tenacity` for sophisticated retry decorators with exponential backoff and jitter. `pybreaker` implements circuit breaker patterns. Pydantic is critical for validating LLM output structure and content rules before accepting responses.
Implement OpenTelemetry traces to track request flow through AI steps and fallbacks. Use Prometheus to track metrics like `llm_retry_count` and `circuit_breaker_state`. Structure logs with request IDs to trace error propagation.
Circuit breakers prevent cascading failures by failing fast. Bulkheads isolate failures to specific AI components. Fallback chains define ordered degradation paths. Exponential backoff with jitter prevents synchronized retry storms during outages.
Answer Strategy
Test the candidate's ability to balance resilience with complexity. Use the framework: 1) Classify errors (transient/permanent), 2) Define retry strategy with backoff, 3) Implement circuit breakers, 4) Design fallbacks that preserve core UX. Sample: 'I'd first classify errors: 429/503 get exponential backoff with jitter up to 3 retries. A circuit breaker would open after 5 consecutive failures, routing traffic to a simplified rule-based fallback that extracts key entities without full AI generation. All errors and fallbacks get logged with trace IDs for post-mortems.'
Answer Strategy
Test for real-world experience and data-driven decision making. Sample: 'On a document processing pipeline, we triggered fallbacks based on three metrics: latency >2s, model confidence score <0.7, or validation failures on output schema. We monitored fallback rates as a key metric-initially at 15%, we tuned validation rules and reduced it to 5% while maintaining output quality. This improved overall pipeline throughput by 30% during peak loads.'
1 career found
Try a different search term.