Skip to main content

Interview Prep

AI Dark Data Analyst Interview Questions

50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.

Beginner: 5Intermediate: 10Advanced: 10Scenario-Based: 10AI Workflow & Tools: 10Behavioral: 5

Beginner

5 questions
What a great answer covers:

A great answer defines dark data as collected but unanalyzed data, cites the 55-90% statistic, and explains the cost/risk/opportunity triad.

What a great answer covers:

Cover emails, log files, sensor/IoT streams, scanned documents, images/video, chat logs, call recordings, and legacy database exports.

What a great answer covers:

Use concrete examples: relational tables (structured), JSON/XML logs (semi-structured), free-text emails and images (unstructured).

What a great answer covers:

Mention PyPDF2/pdfplumber for native PDFs, Tesseract/PaddleOCR for scanned images, and a batching approach for scale.

What a great answer covers:

A strong answer covers listing objects, sampling file types/sizes/extensions, reading metadata headers, and building a summary distribution.

Intermediate

10 questions
What a great answer covers:

Discuss scoring axes: data volume, estimated business value, compliance risk, processing cost, and stakeholder demand. Show a weighted matrix approach.

What a great answer covers:

Explain the index-retrieve-generate pattern, why it beats fine-tuning for heterogeneous corpora, and how it provides grounded, citable answers.

What a great answer covers:

Discuss semantic chunking over fixed-size splits, overlap windows, section-aware splitting, and the trade-off between retrieval granularity and context.

What a great answer covers:

Cover NER basics (person, org, location, product), choosing pre-trained vs. fine-tuned models, handling noisy text, and output schema design.

What a great answer covers:

Discuss automated profiling with Great Expectations, schema inference, encoding detection (chardet), normalization rules, and quarantine pipelines for unparseable records.

What a great answer covers:

Cover provenance, schema, freshness date, file format, owner, PII flags, quality score, business domain tags, and linkage to upstream systems.

What a great answer covers:

Discuss human-in-the-loop spot checks, confidence scoring, cross-validation against known ground truth samples, and citation tracing.

What a great answer covers:

Cover model choice rationale, preprocessing steps, hyperparameter tuning (number of topics), and how you'd present topics to non-technical stakeholders.

What a great answer covers:

Discuss accuracy on complex layouts, cost at scale, latency, language support, and the hybrid approach of routing by document complexity.

What a great answer covers:

Cover stratified sampling by data type and source system, confidence intervals, progressive expansion, and how you'd report coverage to stakeholders.

Advanced

10 questions
What a great answer covers:

Discuss data asset valuation, anonymization pipelines, API-based data product delivery, marketplace integration, usage analytics, and compliance gating.

What a great answer covers:

Cover cost, data volume requirements, latency, hallucination risk, freshness of knowledge, and the hybrid fine-tune-plus-RAG approach for maximum accuracy.

What a great answer covers:

Discuss modality-specific extractors, unified embedding space (CLIP-style), cross-modal retrieval, and the orchestration challenges of heterogeneous processing.

What a great answer covers:

Cover time-series feature engineering, unsupervised models (isolation forest, autoencoders), alerting thresholds, and how LLMs can help interpret anomalies in natural language.

What a great answer covers:

Discuss PII detection models (Microsoft Presidio, spaCy NER), redaction-before-inference pipelines, on-premises LLM options, and audit logging.

What a great answer covers:

Cover indexing algorithm (HNSW vs IVF), recall/latency benchmarks, filtering capabilities, cost, managed vs. self-hosted, and incremental indexing support.

What a great answer covers:

Discuss graph-based lineage tracking, integration with dbt/Airflow metadata, provenance chains for LLM-generated summaries, and audit-readiness for regulated industries.

What a great answer covers:

Discuss schema inference libraries, LLM-assisted column mapping, canonical data models, and the role of a semantic layer in unifying heterogeneous fields.

What a great answer covers:

Cover calibrated probability outputs, ensemble agreement across multiple models, retrieval score as a proxy for grounding, and human calibration loops.

What a great answer covers:

Discuss scheduled ingestion DAGs, incremental processing, drift detection in source data, alerting on pipeline failures, and continuous retraining of extraction models.

Scenario-Based

10 questions
What a great answer covers:

Cover data extraction and de-identification, NER for symptoms/conditions, temporal pattern analysis, clinical validation partnerships, and IRB/ethical considerations.

What a great answer covers:

Discuss PII/PHI scanning, sentiment and intent classification, communication network analysis, regulatory keyword detection, and a phased risk-scoring report.

What a great answer covers:

Cover telemetry profiling, feature engineering on raw sensor streams, anomaly detection model training, validation with known failure events, and deployment to a monitoring dashboard.

What a great answer covers:

Discuss retrieval quality audit (are the right chunks being retrieved?), chunk size tuning, stricter citation requirements, retrieval score thresholds, and fallback-to-human workflows.

What a great answer covers:

Cover language detection, multilingual NER models (XLM-R, mBERT), cross-language topic modeling, translation-as-a-preprocessing option, and unified taxonomy design.

What a great answer covers:

Discuss JSON schema inference at scale, sampling and profiling, identifying key fields vs. noise, building a canonical schema, and creating a data dictionary.

What a great answer covers:

Cover OCR/HTR (handwriting recognition) pipeline, quality scoring, human-in-the-loop correction, Elasticsearch indexing, accessibility requirements, and cost estimation.

What a great answer covers:

Emphasize immediate escalation to legal/compliance, documentation of findings, no further analysis until policy review, and recommending a transparent remediation plan.

What a great answer covers:

Discuss volume limits, data privacy concerns with public APIs, need for structured pipelines rather than ad-hoc queries, phased approach, and proof-of-concept framing.

What a great answer covers:

Discuss automated quality scoring (WER estimation), confidence-weighted analysis, prioritizing high-quality transcripts first, and investing in a custom ASR model for the poor-quality audio domain.

AI Workflow & Tools

10 questions
What a great answer covers:

Cover document loaders, text splitters, embedding model choice, vector store, retriever configuration (similarity, MMR), prompt template, chain type, and output parsers.

What a great answer covers:

Discuss dataset creation and annotation, model selection (BERT-base vs. domain-specific), training with Trainer API, evaluation metrics (precision/recall/F1), and deployment via Inference API.

What a great answer covers:

Cover task definitions (sensor for S3, extraction operator, LLM API call task, Snowflake load operator), dependencies, retry logic, and alerting on failures.

What a great answer covers:

Discuss defining a JSON schema for contract fields, crafting system prompts, handling partial extractions, retrying on malformed outputs, and validating against the schema programmatically.

What a great answer covers:

Discuss reciprocal rank fusion, querying both systems in parallel, merging and re-ranking results, and tuning the keyword-vs-semantic weight balance.

What a great answer covers:

Cover defining expectation suites (column types, value ranges, null thresholds, uniqueness), checkpoint configuration, and integrating validation into the Airflow DAG as a quality gate.

What a great answer covers:

Discuss Prefect flows and tasks, dynamic task generation, conditional branching (if PDF β†’ OCR path, if JSON β†’ parse path), and results caching for idempotency.

What a great answer covers:

Cover building a golden test set of (query, expected document) pairs, measuring recall@k, MRR, and nDCG, and using tools like RAGAS or custom evaluation scripts.

What a great answer covers:

Discuss how the engine decomposes a complex question into sub-questions, routes each to the appropriate tool/index, and synthesizes the final answer from sub-answers.

What a great answer covers:

Discuss correction UI, storing corrections as training data, periodic fine-tuning or few-shot example updates, and tracking extraction accuracy metrics over time.

Behavioral

5 questions
What a great answer covers:

Look for investigative curiosity, persistence through messy data, stakeholder persuasion skills, and measurable business impact.

What a great answer covers:

Assess ability to translate technical findings into business language, use of visuals, framing in terms of ROI or risk, and handling of follow-up questions.

What a great answer covers:

Look for adaptability, creative problem-solving, willingness to change scope, and transparent communication about limitations.

What a great answer covers:

Assess learning habits (communities, papers, courses), practical application of new knowledge, and ability to distinguish hype from genuine utility.

What a great answer covers:

Look for respectful disagreement backed by evidence, willingness to compromise, and focus on shared goals rather than being right.