AI Semantic Search Engineer
An AI Semantic Search Engineer designs and builds search systems that understand intent and meaning rather than mere keywords, lev…
Skill Guide
The systematic process of segmenting documents into optimal units (chunks) and normalizing text content to maximize the precision and recall of information retrieval systems, particularly in RAG and search pipelines.
Scenario
You are given 50 technical FAQ documents in .txt format. The goal is to prepare them for a simple vector search Q&A system.
Scenario
A 100-page PDF technical manual with sections, subsections, bullet points, tables, and diagrams needs to be ingested for a high-precision support chatbot.
Scenario
Your organization has a knowledge base of PDFs, Confluence pages, Slack exports, and Jira tickets. The retrieval system must serve multiple use cases: precise technical Q&A and broad thematic research.
Use `unstructured` or `docling` for complex, real-world documents (PDFs, HTML) to get structured elements. Use `langchain` or `tiktoken`-based splitters for precise, token-aware chunking of clean text.
Use SBERT or commercial APIs to compute embeddings for semantic chunking and retrieval. Use frameworks like RAGAS or DeepEval to programmatically evaluate retrieval quality with metrics like faithfulness, answer relevance, and context precision.
Recursive splitting balances structure and size. Hybrid chunking uses rules for clean structures (headings) and semantics for dense paragraphs. Metadata enrichment (tags, hierarchy) is critical for filtering and providing context to the LLM.
Answer Strategy
The candidate must demonstrate a systematic debugging framework. They should describe inspecting the retrieved context directly for a failing query: checking chunk relevance, completeness of information, and whether the necessary data was chunked together or split across boundaries. A strong answer includes mentioning evaluation metrics (e.g., context recall) and tools like LangSmith for tracing.
Answer Strategy
This tests the ability to handle heterogeneous data. The strategy should involve treating tables as distinct, structured chunks with clear row/column headers preserved as metadata. For narrative text, use semantic or recursive chunking. The key is ensuring that queries about a specific table metric can retrieve the exact table chunk and that questions about trends can retrieve the relevant analytical paragraph.
1 career found
Try a different search term.