RAG Engineer
A RAG Engineer designs and builds Retrieval-Augmented Generation pipelines that ground large language model outputs in authoritati…
Skill Guide
The systematic process of parsing diverse document formats into machine-readable text and partitioning them into contextually meaningful segments using linguistic, structural, or AI-driven rules to optimize retrieval and comprehension by Large Language Models (LLMs).
Scenario
You have a set of 10 internal policy PDFs that need to be searchable via a basic vector search script.
Scenario
You need to ingest a mix of complex HTML technical documentation and Word documents containing tables, ensuring the tables remain coherent in the chunks.
Scenario
Build a production-grade pipeline for a legal firm where an AI agent reviews a new contract, decides the best chunking strategy (e.g., by clause), generates a high-level summary chunk, and indexes it for both vector and keyword search.
Essential for converting heterogeneous file types (.pdf, .docx, .pptx, images) into clean text, handling OCR for scanned documents, and extracting structured data like tables.
Frameworks that provide pre-built logic for recursive splitting, character-based splitting, and semantic chunking, allowing developers to focus on strategy rather than boilerplate text manipulation.
Where the processed chunks and their embeddings are stored. The choice depends on scale (FAISS for local, Pinecone for managed cloud), filtering needs, and hybrid search capabilities.
Answer Strategy
Use a hybrid strategy. Implement a two-tiered approach: 1) 'Macro-chunks' (by chapter or major section) for thematic questions, using a large chunk size (1000+ tokens). 2) 'Micro-chunks' (by paragraph or sentence) for factual questions, using a small, overlapping chunk size (200-300 tokens). Index both with different metadata tags (e.g., 'chunk_type: thematic' vs 'chunk_type: factual'). The retriever can then filter or combine results based on query classification.
Answer Strategy
Testing Retrieval Quality. First, I'd create a 'golden test set' of queries and expected source paragraphs. Then, I'd analyze the retrieval step in isolation: are the top-K chunks returned actually containing the correct information? If not, I'd examine the chunks: 1) Check if the relevant information is split across two chunks (fix with more overlap). 2) Check if chunks are too large, diluting the key info (fix with smaller chunk size). 3) Check if metadata/context is lost (fix by prepending headers). I'd iterate using precision/recall metrics on my test set before touching the LLM.
1 career found
Try a different search term.