AI Identity & Access Management Specialist
An AI Identity & Access Management Specialist designs, implements, and governs the authentication, authorization, and privilege fr…
Skill Guide
The design, deployment, and tuning of an intermediary proxy layer (API Gateway) that controls, secures, and optimizes traffic to Large Language Model inference endpoints by enforcing policies like request throttling, token budgets, and traffic shaping.
Scenario
You have a simple Python FastAPI endpoint simulating an LLM service (adds artificial delay). You need to protect it from being overwhelmed by a test script that sends rapid requests.
Scenario
Your company offers an LLM API with 'Free', 'Pro', and 'Enterprise' tiers. Each tier has different request-per-minute (RPM) and monthly token quotas. You need to enforce these limits at the gateway.
Scenario
You are the architect for a global AI platform. LLM inference is expensive and run in multiple regions. You need to route requests to the nearest region, enforce strict per-customer cost budgets (based on token usage), and gracefully degrade service (e.g., switch to a cheaper model) during regional outages or cost overruns.
Use Kong for its rich plugin ecosystem (authentication, rate limiting, logging). Use Envoy for maximum performance and flexibility in cloud-native (Kubernetes) environments. Use managed cloud gateways for rapid deployment, integrated billing, and native cloud IAM.
Prometheus is standard for collecting gateway metrics (request count, latency, 4xx/5xx rates). Grafana visualizes these for dashboards and alerting. The ELK stack analyzes gateway logs for deep traffic inspection. OpenTelemetry provides a unified standard for traces and metrics across services.
Treat gateway configuration as code. Use Terraform/Pulumi to provision and manage gateway resources. In Kubernetes, use an Ingress Controller (like Kong Ingress Controller) managed via CRDs. Use GitOps tools to automate deployment of gateway configuration changes from a version-controlled repository.
Answer Strategy
Structure the answer around a systematic diagnosis: 1) Check gateway metrics to identify the error type distribution and correlation with traffic patterns. 2) Check backend (LLM service) health and logs. 3) Analyze the current rate-limiting algorithm and settings. 4) Propose specific actions: if the backend is healthy, adjust the gateway's upstream timeout and connection limits; if it's truly overloaded, implement a more sophisticated rate limit (e.g., token-based) and add a queue with exponential backoff at the client level. Mention using the gateway's circuit breaker feature to prevent cascading failures.
Answer Strategy
The core competency tested is designing nuanced, business-aware technical policies. A strong answer separates traffic by type: 1) Assign the batch customer a dedicated API key. 2) Configure two separate rate limit 'buckets' on the gateway: one for 'interactive' (low latency, lower RPM, higher priority) and one for 'batch' (higher latency tolerance, higher RPM allowed, but lower priority). 3) Implement queuing for the batch key that absorbs bursts. 4) Use gateway headers to communicate the job's priority to the backend, allowing it to allocate resources accordingly. 5) Agree on SLA guarantees for each bucket.
1 career found
Try a different search term.