Skill Guide

Security and access control for retrieval layers (metadata filtering, document-level ACLs)

The implementation of security policies directly within the retrieval pipeline of a RAG or search system, using metadata attributes and document-level access control lists (ACLs) to filter, gate, or rank results before they are presented to the user or LLM.

This skill is critical for building enterprise-grade AI applications that handle sensitive data, as it prevents unauthorized information disclosure and ensures compliance. It directly impacts business outcomes by enabling the safe adoption of AI on proprietary or regulated data, reducing legal risk, and building trust in the system.

1 Careers

1 Categories

9.0 Avg Demand

15% Avg AI Risk

How to Learn Security and access control for retrieval layers (metadata filtering, document-level ACLs)

1. Understand the core components: User identity (OIDC tokens, API keys), document metadata schema, and ACL data structures (e.g., group-based, role-based). 2. Learn the fundamental query pattern: pre-filtering vs. post-filtering. 3. Practice writing basic filter expressions (e.g., SQL WHERE clauses, JSON filter objects) against a vector database.

1. Implement a full metadata filter chain in a vector DB like Weaviate or Pinecone, combining tenant_id, department, and classification_level tags. 2. Design an ACL inheritance model for a document hierarchy (e.g., all files in a 'Project X' folder inherit project-level permissions). 3. Avoid the common mistake of performing expensive similarity search before applying a cheap metadata filter.

1. Architect a hybrid retrieval system that uses vector search for semantic relevance and metadata filtering for strict access control, optimizing for latency. 2. Design a system where the access control policy itself is context-aware (e.g., 'allow access to document D only if user U is in role R AND query Q is about topic T'). 3. Develop a strategy for auditing, logging, and explaining why a specific document was included or excluded from a result set for a given user.

Practice Projects

Beginner

Project

Build a Tenant-Aware Search API

Scenario

You are building a SaaS search feature for a multi-tenant application. Each tenant's documents must be completely isolated from others. Users should only see documents belonging to their own company (tenant).

How to Execute

1. Provision a vector database (e.g., Qdrant). Create a collection with a field 'tenant_id'. 2. Ingest sample documents, embedding the 'tenant_id' as a metadata field. 3. Write a Python function that takes a user's 'tenant_id' (from their auth token) and a search query, constructs a pre-filter for the database query (`filter: {'tenant_id': user_tenant_id}`), and executes the search. 4. Test by querying with different tenant IDs and verifying isolation.

Intermediate

Project

Implement Role-Based Document Retrieval

Scenario

A knowledge base for a corporation contains documents tagged with 'department' (HR, Engineering, Finance) and 'access_level' (public, internal, confidential). An employee in the 'Engineering' department with 'internal' access should see public and internal engineering docs, but not HR confidential docs.

How to Execute

1. Design the metadata schema and user role model. Store user roles (e.g., `{'department': 'Engineering', 'clearance': 'internal'}`) in an identity provider. 2. In the retrieval service, after user authentication, fetch the user's roles. 3. Construct a complex metadata filter: `filter: {'must': [{'department': user.department}, {'access_level': {'$lte': user.clearance}}]}`. 4. Implement and test edge cases, like a user with 'confidential' clearance accessing 'public' docs from all departments.

Advanced

Project

Design a Secure, Auditable RAG Pipeline

Scenario

Design a RAG system for a legal firm where the answer to a query must only be synthesized from documents the querying lawyer is explicitly authorized to view, and every retrieval step must be logged for audit trails.

How to Execute

1. Implement a retrieval service that first enriches the user query with their full permission set (e.g., via a user graph lookup). 2. Use a vector DB with advanced filtering to apply this full ACL set at query time. 3. Implement a two-stage retrieval: first a fast, broad filter (e.g., by case file), then a fine-grained ACL filter on the shortlist. 4. Build an audit log that records the user ID, query, the applied filter, the list of candidate documents, and the final documents passed to the LLM for generation.

Tools & Frameworks

Vector Databases & Search Engines

PineconeWeaviateQdrantElasticsearch with Vector Search

These are the primary data stores for implementing retrieval. Their native support for metadata filtering, pre-filtering, and hybrid search (keyword + vector) is foundational for building secure retrieval layers. Choose based on needed filter complexity, scalability, and operational model.

Identity & Access Management (IAM)

Auth0OktaAWS IAMKeycloak

Used to authenticate users and provide the foundational claims (user ID, roles, groups, tenant) that are translated into retrieval-layer filters. The retrieval system must integrate with these services to obtain and validate user context.

Orchestration & Frameworks

LangChain (Retrieval modules)LlamaIndex (Node Post-Processors)Haystack

These frameworks provide abstractions to insert custom filtering logic into the retrieval pipeline (e.g., LlamaIndex's `MetadataReplacementPostProcessor`). They simplify integrating IAM context with retrieval queries, but require careful configuration to avoid security bypasses.

Interview Questions

Answer Strategy

Use a structured system design approach: 1. Data Model (ACLs in metadata), 2. Auth Integration (how to get user roles), 3. Query Pipeline (where filtering happens), 4. Fallbacks (what if filter is empty). Sample: 'I would store documents with metadata fields for department and a numeric sensitivity level. Upon user query, the service would fetch the user's department and clearance level from the IAM system. The retrieval query to the vector database would include a pre-filter requiring the document's department match the user's and its sensitivity level be <= the user's clearance. This is applied before the similarity search, ensuring unauthorized docs never enter the context window. For edge cases like cross-department projects, I would implement tag-based ACLs as an override.'

Answer Strategy

Tests negotiation skills and understanding of non-functional requirements. The core competency is balancing security/compliance with performance. Sample: 'I would acknowledge the latency concern and present data on the current performance. However, I would explain that metadata filtering is a security-critical control, not just a feature. Removing it would violate our compliance policies for data isolation. Instead, I would propose optimizing the filter chain-for example, by indexing the most restrictive metadata fields first, or using a faster, albeit less precise, pre-filter before a more expensive vector search. We can also explore caching user permission sets. I'd request a joint session with security/compliance to align on acceptable trade-offs.'