AI Long-Context Systems Engineer
An AI Long-Context Systems Engineer designs and builds production systems that exploit large context windows (128K-10M+ tokens) in…
Skill Guide
Using Python's `asyncio`, `httpx`, or `aiohttp` to handle concurrent LLM API calls, managing non-blocking I/O for high-throughput, low-latency pipelines that integrate models like GPT, Claude, or open-source LLMs.
Scenario
You have a CSV file with 1000 user questions. You need to send each question to the OpenAI API and save responses without exceeding rate limits.
Scenario
Build a service that accepts a prompt, sends it to OpenAI. If OpenAI fails or is slow (timeout > 2s), automatically retry with Anthropic's Claude. Log all attempts and final responses.
Scenario
You need to process 100k+ prompts daily from a web app. The system must dynamically batch requests to minimize token cost while respecting per-minute rate limits and handling streaming responses.
`asyncio` is the core library. `httpx`/`aiohttp` handle async HTTP. `tenacity` manages retries with backoff. `Celery+Redis` scales tasks. `Pydantic` validates API request/response schemas.
Patterns for building resilient, high-throughput systems. Use queues to decouple ingestion from API calls. Circuit breakers prevent cascading failures. Token bucket algorithms smooth out burst traffic.
Answer Strategy
Use the STAR method (Situation, Task, Action, Result). Show you understand rate limits are per-API-key/per-model, not just per-client. Sample answer: 'I'd first add detailed logging to track request timestamps and identify burst patterns. If rate limits are per-minute, I'd switch from a fixed semaphore to a dynamic rate limiter (e.g., using `asyncio.sleep` based on header `Retry-After`). For cost-critical apps, I'd implement a token bucket algorithm in a shared Redis key to synchronize limits across multiple worker processes.'
Answer Strategy
Tests understanding of streaming APIs and frontend-backend integration. Sample answer: 'I'd use the streaming endpoint from the LLM provider (e.g., `stream=True` in OpenAI). My async generator would yield tokens as they arrive, forwarding them via a WebSocket or Server-Sent Events (SSE) to the client. The frontend would then render tokens incrementally. The async backend ensures the main thread isn't blocked while waiting for the full response.'
1 career found
Try a different search term.