Interview Prep
AI eDiscovery Specialist Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA strong answer covers the EDRM stages (identification, preservation, collection, processing, review, analysis, production) and explains where most cost and effort concentrates (review).
The answer should distinguish relevance (material to the case issues) from privilege (protected from disclosure, e.g., attorney-client privilege or work product doctrine).
A good answer covers the obligation to preserve potentially relevant ESI when litigation is reasonably anticipated, and the consequences of spoliation if a hold fails.
The answer should mention emails, instant messages (Slack, Teams), documents, social media, cloud storage, mobile data, databases, and metadata.
A strong answer explains hash-based deduplication (MD5/SHA-1), custodian-level vs. global deduplication, and how it reduces review volume and cost.
Intermediate
10 questionsThe answer should cover TAR 1.0 (train-then-predict with a seed set and cutoff) vs. TAR 2.0 (continuous active learning with no stopping point until recall targets are met).
A strong answer discusses stratified random sampling, richness-based sampling, risks of cherry-picking obvious documents, and the impact of seed set quality on model performance.
The answer should cover shingling/Simhash for near-duplicates, email threading algorithms that collapse conversation chains, and how both reduce redundant review.
A strong answer covers elusion testing (testing the unreviewed set for responsive documents), recall estimation, confidence intervals, and the acceptance threshold.
The answer should address GDPR restrictions on personal data transfer, standard contractual clauses, data minimization, redaction/anonymization strategies, and the Hague Convention.
A good answer references Federal Rule 26(b)(1), cost-benefit balancing, and how AI reduces review costs to shift proportionality calculations.
The answer should cover Rule 26(b)(5) requirements (description of withheld documents without revealing privileged content), and how LLMs can draft log entries from document metadata and content.
A strong answer walks through ingestion, metadata extraction, text extraction, deduplication, date filtering, domain filtering, threading, and loading into a review platform.
The answer should discuss precision/recall trade-offs, the Da Silva Moore case establishing TAR acceptance, and hybrid approaches combining both methods.
A strong answer covers LDA, BERTopic, or similar approaches, how clustering reveals themes, and how this supports issue coding and deposition preparation.
Advanced
10 questionsThe answer should cover iterative training loops, active learning sampling strategies (uncertainty sampling, margin sampling), reviewer batch design, stop criteria, elusion testing, and defensible documentation.
A strong answer covers domain-specific fine-tuning on legal corpora, handling class imbalance (privileged docs are rare), distinguishing attorney-client privilege from work product, and the higher stakes of false negatives.
The answer should cover vector embeddings with OpenAI or HuggingFace, chunking strategies for legal documents, metadata filtering, retrieval ranking, and LLM-based classification with citations back to source documents.
A strong answer covers documenting the full TAR protocol, presenting precision/recall metrics, elusion test results, seed set methodology, QC sampling results, and citing case law supporting TAR defensibility.
The answer should cover stratified performance evaluation by language/custodian, balanced sampling, multilingual models, bias auditing, and recalibration strategies.
A strong answer addresses S3 storage tiers, spot instances for batch processing, Lambda for lightweight extraction, Textract pricing for OCR, model inference optimization, and lifecycle policies for archival.
The answer should discuss threshold calibration, human review of borderline documents, buffer zone strategies, and how to document the decision rationale for defensibility.
A strong answer addresses data privacy risks, API data retention policies, the need for enterprise agreements, zero-retention endpoints, and bar association opinions on using AI with client confidences.
The answer should cover named entity recognition with spaCy or AWS Comprehend, regex patterns for SSNs/financial data, confidence thresholds, human QC sampling, and redaction permanence verification.
A strong answer provides a specific scenario (e.g., coded language in financial fraud), explains the semantic search or embedding-based approach, and quantifies the improvement with metrics.
Scenario-Based
10 questionsA strong answer covers prioritization by custodian relevance, parallel processing, TAR for early prioritization, defensible sampling, and phased production strategy.
The answer should cover metadata analysis (creation vs. modification dates, author field anomalies), batch comparison tools, forensic preservation of findings, and reporting to counsel.
A strong answer discusses analyzing error patterns, expanding seed set diversity, adjusting sampling strategy, reviewing false negatives for pattern recognition, and setting a recall target aligned with proportionality.
The answer should cover multilingual models (XLM-R, mBERT), language-specific preprocessing, per-language performance benchmarking, translation for QC, and handling code-switching in documents.
A strong answer covers SOC 2 compliance, encryption at rest and in transit, data residency requirements, access controls, and the option for on-premise or hybrid deployment architectures.
The answer should address short-message challenges for NLP, conversation threading in channels vs. DMs, emoji/reaction analysis, temporal clustering, and the higher volume-to-relevance ratio.
A strong answer covers production audit logs, privilege QA workflow improvements, second-pass privilege screening with AI, and updating the TAR model to flag privilege-risk documents.
The answer should discuss domain shift analysis, feature importance examination, fine-tuning on new domain data, active learning to quickly adapt, and evaluating whether a fresh model outperforms transfer.
A strong answer clarifies the boundary between eDiscovery and litigation analytics, discusses what's feasible (sentiment analysis, strength indicators) vs. what requires legal expertise, and manages expectations.
The answer should cover Concordance/Relativity load file formats (DAT/OPT/LFP), field mapping, metadata validation, image numbering, OCR text file alignment, and hash verification of the production set.
AI Workflow & Tools
10 questionsA strong answer covers data preprocessing, tokenization, fine-tuning a BERT-based model on coded documents, evaluation with legal-domain metrics, model versioning, and integration with the review platform.
The answer should cover chain-of-thought prompting, sequential chains or LCEL pipelines, output parsing with Pydantic models, retry logic for robustness, and cost/latency considerations.
A strong answer covers chunking strategies for long documents, embedding model selection, vector store options (Pinecone, Weaviate, FAISS), metadata filtering for date/custodian constraints, and relevance ranking.
The answer should cover Textract's asynchronous API for batch OCR, Comprehend custom classifiers, handling multi-page documents, table extraction, and cost optimization with S3-based workflows.
A strong answer covers Git for code and configs, DVC or MLflow for model versioning, Docker containers for environment reproducibility, and matter-specific configuration files.
The answer should cover structured output prompts (JSON mode), confidence scoring, human QC sampling, hallucination mitigation with RAG from the actual document, and formatting compliance with Rule 26(b)(5).
A strong answer compares Relativity's built-in Active Learning (priority queue, project settings) with custom scikit-learn implementations, discussing trade-offs in flexibility, defensibility documentation, and ease of use.
The answer should cover pre-trained NER models, fine-tuning on legal entity types (judge names, case numbers, financial account numbers), handling false positives, and integrating NER results into a redaction workflow.
A strong answer covers Apache Airflow or similar orchestration, format-specific extractors (PST, MBOX, SharePoint), metadata normalization, text extraction, deduplication, and loading into Elasticsearch or a review platform.
The answer should cover tracking prediction distributions over time, comparing against baseline distributions, sampling for human validation, automated retraining triggers, and dashboard visualization with Grafana or similar tools.
Behavioral
5 questionsA strong answer demonstrates attention to detail, proactive problem identification, clear communication to stakeholders, and a systematic approach to remediation.
The answer should demonstrate the ability to translate technical concepts into practical legal implications, using analogies and visualizations rather than jargon.
A strong answer shows pragmatic prioritization, creative use of sampling to validate quickly, transparent communication about trade-offs, and adherence to defensibility standards.
The answer should mention specific sources (Sedona Conference, ILTA, Legaltech News, Relativity Fest), professional communities, and a habit of continuous learning.
A strong answer demonstrates stakeholder management, clear communication of technical constraints, prioritization frameworks, and the ability to find solutions that satisfy multiple requirements.