Interview Prep
AI Dark Data Analyst Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsA great answer defines dark data as collected but unanalyzed data, cites the 55-90% statistic, and explains the cost/risk/opportunity triad.
Cover emails, log files, sensor/IoT streams, scanned documents, images/video, chat logs, call recordings, and legacy database exports.
Use concrete examples: relational tables (structured), JSON/XML logs (semi-structured), free-text emails and images (unstructured).
Mention PyPDF2/pdfplumber for native PDFs, Tesseract/PaddleOCR for scanned images, and a batching approach for scale.
A strong answer covers listing objects, sampling file types/sizes/extensions, reading metadata headers, and building a summary distribution.
Intermediate
10 questionsDiscuss scoring axes: data volume, estimated business value, compliance risk, processing cost, and stakeholder demand. Show a weighted matrix approach.
Explain the index-retrieve-generate pattern, why it beats fine-tuning for heterogeneous corpora, and how it provides grounded, citable answers.
Discuss semantic chunking over fixed-size splits, overlap windows, section-aware splitting, and the trade-off between retrieval granularity and context.
Cover NER basics (person, org, location, product), choosing pre-trained vs. fine-tuned models, handling noisy text, and output schema design.
Discuss automated profiling with Great Expectations, schema inference, encoding detection (chardet), normalization rules, and quarantine pipelines for unparseable records.
Cover provenance, schema, freshness date, file format, owner, PII flags, quality score, business domain tags, and linkage to upstream systems.
Discuss human-in-the-loop spot checks, confidence scoring, cross-validation against known ground truth samples, and citation tracing.
Cover model choice rationale, preprocessing steps, hyperparameter tuning (number of topics), and how you'd present topics to non-technical stakeholders.
Discuss accuracy on complex layouts, cost at scale, latency, language support, and the hybrid approach of routing by document complexity.
Cover stratified sampling by data type and source system, confidence intervals, progressive expansion, and how you'd report coverage to stakeholders.
Advanced
10 questionsDiscuss data asset valuation, anonymization pipelines, API-based data product delivery, marketplace integration, usage analytics, and compliance gating.
Cover cost, data volume requirements, latency, hallucination risk, freshness of knowledge, and the hybrid fine-tune-plus-RAG approach for maximum accuracy.
Discuss modality-specific extractors, unified embedding space (CLIP-style), cross-modal retrieval, and the orchestration challenges of heterogeneous processing.
Cover time-series feature engineering, unsupervised models (isolation forest, autoencoders), alerting thresholds, and how LLMs can help interpret anomalies in natural language.
Discuss PII detection models (Microsoft Presidio, spaCy NER), redaction-before-inference pipelines, on-premises LLM options, and audit logging.
Cover indexing algorithm (HNSW vs IVF), recall/latency benchmarks, filtering capabilities, cost, managed vs. self-hosted, and incremental indexing support.
Discuss graph-based lineage tracking, integration with dbt/Airflow metadata, provenance chains for LLM-generated summaries, and audit-readiness for regulated industries.
Discuss schema inference libraries, LLM-assisted column mapping, canonical data models, and the role of a semantic layer in unifying heterogeneous fields.
Cover calibrated probability outputs, ensemble agreement across multiple models, retrieval score as a proxy for grounding, and human calibration loops.
Discuss scheduled ingestion DAGs, incremental processing, drift detection in source data, alerting on pipeline failures, and continuous retraining of extraction models.
Scenario-Based
10 questionsCover data extraction and de-identification, NER for symptoms/conditions, temporal pattern analysis, clinical validation partnerships, and IRB/ethical considerations.
Discuss PII/PHI scanning, sentiment and intent classification, communication network analysis, regulatory keyword detection, and a phased risk-scoring report.
Cover telemetry profiling, feature engineering on raw sensor streams, anomaly detection model training, validation with known failure events, and deployment to a monitoring dashboard.
Discuss retrieval quality audit (are the right chunks being retrieved?), chunk size tuning, stricter citation requirements, retrieval score thresholds, and fallback-to-human workflows.
Cover language detection, multilingual NER models (XLM-R, mBERT), cross-language topic modeling, translation-as-a-preprocessing option, and unified taxonomy design.
Discuss JSON schema inference at scale, sampling and profiling, identifying key fields vs. noise, building a canonical schema, and creating a data dictionary.
Cover OCR/HTR (handwriting recognition) pipeline, quality scoring, human-in-the-loop correction, Elasticsearch indexing, accessibility requirements, and cost estimation.
Emphasize immediate escalation to legal/compliance, documentation of findings, no further analysis until policy review, and recommending a transparent remediation plan.
Discuss volume limits, data privacy concerns with public APIs, need for structured pipelines rather than ad-hoc queries, phased approach, and proof-of-concept framing.
Discuss automated quality scoring (WER estimation), confidence-weighted analysis, prioritizing high-quality transcripts first, and investing in a custom ASR model for the poor-quality audio domain.
AI Workflow & Tools
10 questionsCover document loaders, text splitters, embedding model choice, vector store, retriever configuration (similarity, MMR), prompt template, chain type, and output parsers.
Discuss dataset creation and annotation, model selection (BERT-base vs. domain-specific), training with Trainer API, evaluation metrics (precision/recall/F1), and deployment via Inference API.
Cover task definitions (sensor for S3, extraction operator, LLM API call task, Snowflake load operator), dependencies, retry logic, and alerting on failures.
Discuss defining a JSON schema for contract fields, crafting system prompts, handling partial extractions, retrying on malformed outputs, and validating against the schema programmatically.
Discuss reciprocal rank fusion, querying both systems in parallel, merging and re-ranking results, and tuning the keyword-vs-semantic weight balance.
Cover defining expectation suites (column types, value ranges, null thresholds, uniqueness), checkpoint configuration, and integrating validation into the Airflow DAG as a quality gate.
Discuss Prefect flows and tasks, dynamic task generation, conditional branching (if PDF β OCR path, if JSON β parse path), and results caching for idempotency.
Cover building a golden test set of (query, expected document) pairs, measuring recall@k, MRR, and nDCG, and using tools like RAGAS or custom evaluation scripts.
Discuss how the engine decomposes a complex question into sub-questions, routes each to the appropriate tool/index, and synthesizes the final answer from sub-answers.
Discuss correction UI, storing corrections as training data, periodic fine-tuning or few-shot example updates, and tracking extraction accuracy metrics over time.
Behavioral
5 questionsLook for investigative curiosity, persistence through messy data, stakeholder persuasion skills, and measurable business impact.
Assess ability to translate technical findings into business language, use of visuals, framing in terms of ROI or risk, and handling of follow-up questions.
Look for adaptability, creative problem-solving, willingness to change scope, and transparent communication about limitations.
Assess learning habits (communities, papers, courses), practical application of new knowledge, and ability to distinguish hype from genuine utility.
Look for respectful disagreement backed by evidence, willingness to compromise, and focus on shared goals rather than being right.