Skill Guide

Error handling, retry logic, and graceful degradation for LLM calls

A set of software design patterns and resilience engineering practices specifically applied to handle the inherent non-determinism, latency, and failure modes of Large Language Model (LLM) API calls.

This skill is critical for building production-grade AI applications because it directly determines system reliability and user experience, preventing costly downtime and data loss. It allows organizations to deploy LLM features at scale with confidence, turning a fragile API call into a dependable component of the business logic.

1 Careers

1 Categories

8.8 Avg Demand

25% Avg AI Risk

How to Learn Error handling, retry logic, and graceful degradation for LLM calls

1. Master HTTP status codes (4xx/5xx) and common LLM API error types (rate limits, timeouts, content filters, model overload). 2. Understand the basic structure of an exponential backoff retry loop with jitter. 3. Learn to define and implement basic fallback logic (e.g., switching to a default/cached response or a smaller model).

1. Implement stateful retry logic using a library like tenacity, incorporating circuit breaker patterns (e.g., PyBreaker) to stop hammering a failing service. 2. Design and implement multi-layered graceful degradation strategies: tiered model fallback (e.g., GPT-4 -> GPT-3.5-Turbo -> internal rule-based system), response quality checks, and feature toggles. 3. Avoid common mistakes: retrying non-idempotent operations without care, using uniform backoff leading to thundering herds, and failing to log structured error context for debugging.

1. Architect systems for observability, integrating tracing (OpenTelemetry) and metrics (Prometheus) to monitor LLM error rates, latency percentiles, and retry success rates across the fleet. 2. Design and validate chaos engineering experiments (e.g., injecting latency/errors via a service mesh) to proactively test the resilience of your LLM-dependent workflows. 3. Define organizational standards and create shared, reusable resilience libraries for LLM calls to ensure consistency and reduce cognitive load on development teams.

Practice Projects

Beginner

Project

Build a Resilient LLM Wrapper Client

Scenario

You are tasked with creating a Python class that wraps the OpenAI API. It must handle common errors (rate limits, timeouts, server errors) and return a default 'service unavailable' message if all retries fail.

How to Execute

1. Create a Python class `ResilientLLMClient` with a `generate` method. 2. Inside `generate`, implement a retry loop using `tenacity` with `retry_if_exception_type`, exponential backoff, and a max of 3 retries. 3. Add a `fallback_response` method that returns a predefined safe string. 4. Write a `try-except` block in `generate` that calls the fallback if all retries are exhausted. Test by mocking the API to raise errors.

Intermediate

Project

Implement a Multi-Tiered LLM Service with Circuit Breaker

Scenario

Your application uses a primary LLM (e.g., GPT-4) for high-quality outputs but needs to degrade gracefully. You must also prevent cascading failures if the provider's service becomes completely unavailable.

How to Execute

1. Create a `TieredLLMService` that holds references to a primary (e.g., OpenAI GPT-4) and secondary (e.g., Azure OpenAI GPT-3.5) client. 2. Integrate `pybreaker` to wrap calls to each tier, setting separate failure thresholds and recovery timeouts. 3. In the `call` method, attempt primary. If a `CircuitBreakerError` or persistent error occurs, attempt secondary. If both fail, execute final fallback logic (e.g., query a database for a cached answer). 4. Expose health check endpoints for each circuit breaker for monitoring.

Advanced

Project

Chaos-Test a Production LLM Pipeline

Scenario

You lead a team with a critical, multi-step LLM pipeline (e.g., for legal document analysis). You must prove its resilience before a major launch by systematically injecting faults.

How to Execute

1. Instrument the pipeline with OpenTelemetry for full tracing. 2. Use a service mesh (e.g., Istio) or a fault injection library (e.g., Chaos Toolkit) to define experiments: inject 500ms latency on LLM API endpoints, simulate 429 rate limits, and force model timeouts. 3. Run the experiments in a staging environment while monitoring dashboards (error budgets, retry counts, latency). 4. Analyze results to identify single points of failure, tune retry parameters, and add missing fallback paths. Document findings and update runbooks.

Tools & Frameworks

Software & Platforms

tenacity (Python)pybreaker (Python)OpenTelemetryPrometheus & Grafana

`tenacity` is the industry standard for implementing robust retry logic with decorators. `pybreaker` provides the circuit breaker pattern to fail fast. `OpenTelemetry` is for distributed tracing of LLM call chains. `Prometheus` (metrics) + `Grafana` (dashboards) are used to monitor error rates and latency SLOs.

Cloud & Infrastructure

AWS API Gateway / Azure API ManagementIstio Service MeshAWS Lambda / Azure Functions

Cloud API gateways can handle initial retries and rate limiting at the edge. A service mesh like Istio enables advanced fault injection and retries at the infrastructure level. Serverless functions are common deployment targets that require careful timeout and error handling configuration.

Interview Questions

Answer Strategy

The candidate must demonstrate a systematic, tiered approach beyond simple retries. A strong answer outlines specific layers: 1) Infrastructure Layer (circuit breaker, timeout controls), 2) Retry Layer (exponential backoff with jitter, retry-specific error codes), 3) Fallback Layer (tiered model downgrade, cached responses, human-in-the-loop escalation), and 4) User Experience Layer (clear status communication, offline mode). The strategy should be justified by error type (e.g., don't retry a 401, but do retry a 429).

Answer Strategy

Tests debugging skills and understanding of non-idempotent operations. The answer should identify two potential issues: 1) Lack of jitter causing thundering herd (all clients retrying at once), and 2) Retrying requests that are not idempotent (the model actually processed the first request but the client didn't get the response). The fix involves adding jitter to backoff and implementing idempotency keys (e.g., a unique request ID sent to the API) to prevent duplicate work. Diagnosis would involve analyzing retry logs and tracing requests.