AI Retrieval Systems Engineer
An AI Retrieval Systems Engineer designs, builds, and optimizes the search and retrieval pipelines that power Retrieval-Augmented …
Skill Guide
The architectural discipline of designing distributed data storage and access systems to serve millions of queries per second with single-digit millisecond latency across massive datasets.
Scenario
Design and implement a distributed key-value store that can handle 10,000 writes/sec and 50,000 reads/sec with 99% of requests under 10ms. The dataset is 100GB.
Scenario
Build a system that, given a user ID, retrieves the top 100 most relevant items from a catalog of 1 billion items within 50ms. Relevance is a function of user features and item embeddings.
Scenario
You are the lead architect for a new cloud database service that must support document, graph, and key-value workloads with a 99.999% availability SLA and consistent sub-10ms P99 latency for reads across three continents.
Cassandra/ScyllaDB for linearly scalable, high-write-throughput workloads with tunable consistency. Redis Cluster for low-latency caching and ephemeral data. Kafka as the backbone for decoupling ingestion from serving systems.
Elasticsearch for full-text search and complex aggregations at scale. Direct Lucene use for building custom, high-performance indexes when off-the-shelf solutions are too slow or bloated.
FAISS for high-performance, in-memory ANN search on billions of vectors. Milvus/Weaviate/Pinecone for managed vector database services with filtering, scaling, and persistence.
eBPF for kernel and network stack tracing without instrumentation. perf for CPU flamegraphs. Wireshark for deep packet analysis. Chaos tools for validating system resilience under partial failures.
Answer Strategy
Use a Fan-out on Write vs. Fan-out on Read hybrid approach. Explain that pure fan-out-on-write (pre-computing each user's timeline) is too expensive for users with millions of followers. Use a hybrid: for 99% of users, fan out writes to a pre-materialized timeline store (e.g., Redis sorted sets). For the top 1% of celebrities (high-follower accounts), do fan-out-on-read at query time by fetching their latest tweets and merging them. Mention sharding timelines by user ID, using a message queue (Kafka) for async fan-out, and a cache layer for hot timelines.
Answer Strategy
Test the candidate's methodical, layered diagnostic process. A strong answer identifies the problem is in the tail (P99), points to specific tools (eBPF for lock contention, perf for CPU, network traces), and considers non-obvious causes like GC pauses, noisy neighbors, or a single slow disk. Sample: 'First, I'd check if the spike correlates with a specific time, tenant, or query pattern. I'd use bpftrace to check for lock contention in the storage engine. Simultaneously, I'd inspect disk I/O latency metrics per device to isolate a hardware issue. Finally, I'd analyze query plans for the affected requests to see if an index is missing or a full table scan is occurring on a hot partition.'
1 career found
Try a different search term.