Interview Prep
AI Master Data Management Specialist Interview Questions
50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.
Beginner
5 questionsExplain that master data represents core business entities (customer, product, supplier, location) that are shared across systems and change infrequently, whereas transactional data records business events and reference data provides classification codes.
Describe it as the single, authoritative, best version of a master data entity created by merging or selecting from multiple source records using survivorship rules.
Convey that without MDM, the same entity exists in conflicting forms across systems, leading to inconsistent reporting, poor customer experience, compliance risk, and duplicated effort.
Explain that data stewards are business-side owners responsible for data quality within their domain, approving match/merge decisions, and enforcing governance policies.
Cover accuracy, completeness, consistency, timeliness, uniqueness, and validity - pick three and explain why each matters for golden records.
Intermediate
10 questionsDiscuss m-probabilities (agreement when records match), u-probabilities (agreement by chance), comparison vectors, and how a composite weight determines match/non-match/maybe classification.
Explain that blocking reduces the comparison space by grouping records into candidate pairs using keys like first-letter-of-surname + zip code, Soundex codes, or n-grams, making matching computationally feasible.
Discuss strategies like most-trusted-source, most-recent, completeness-based selection, and manual override, and explain how they are configured in MDM hubs to produce the golden record.
Explain that registry stores minimal cross-references, coexistence centralizes golden records but syncs back to sources, and transactional hub is the system of entry; choice depends on latency requirements and source system autonomy.
Describe modeling customers, households, products, and suppliers as nodes with typed edges (purchases, belongs_to, supplies) to traverse relationships that are awkward in relational schemas, such as corporate hierarchies.
Explain lineage documents where golden record fields originate, what transformations they undergo, and which downstream systems consume them - tools like Collibra, Alation, or OpenLineage.
Describe using dbt as the transformation layer to standardize, deduplicate, and model master data tables with version-controlled SQL, built-in testing (freshness, uniqueness), and documentation generation.
Cover precision (false positive rate of matched pairs), recall (false negative rate), F1 score, and practical metrics like false merge rate from data steward audits and match completeness across domains.
Discuss transliteration, phonetic algorithms for non-Latin scripts, tokenization differences in CJK languages, and the use of multilingual sentence embeddings from models like LaBSE.
Include completeness (required fields populated), validity (values conform to allowed ranges), uniqueness (duplicate rate), timeliness (latency of updates from source), consistency (cross-field logic), and accuracy (compared to ground truth).
Advanced
10 questionsDescribe a lambda or kappa architecture where batch processes do full re-matching nightly while a real-time layer uses pre-computed blocking indexes and lightweight ML models for low-latency lookups during data entry.
Explain: initial model trained on labeled pairs β model scores new candidate pairs β uncertain pairs (confidence in 'maybe' zone) are routed to data stewards for labeling β labels feed back into retraining β blocking strategies also evolve based on miss analysis.
Discuss propagating erasure requests across all connected source systems, handling referential integrity (anonymize vs. delete), audit logging, and designing consent-aware golden records where PII fields can be selectively masked without breaking downstream analytics.
Traditional algorithms are fast, interpretable, and work well for structured fields like names and addresses; embeddings capture semantic similarity for unstructured text and multilingual matching but add latency and require GPU infrastructure; often you combine both in an ensemble.
Describe a hub with domain-specific matching microservices, per-domain stewardship workflows, shared infrastructure (catalog, quality monitoring, API gateway), and configurable survivorship per domain - all governed by a central data governance council.
Cover tiered matching (high-confidence auto-merge, medium-confidence steward review, low-confidence reject), hard-match constraints (e.g., SSN exact match required), temporal awareness, and post-merge monitoring with rollback capability.
Explain shadow-matching (run new model in parallel without affecting production), canary releases on a subset of domains, golden record snapshots for rollback, and comparison dashboards tracking precision/recall of v1 vs. v2.
Discuss using LLMs to draft glossary entries from schema metadata and data samples, human-in-the-loop review, embedding-based deduplication across departments, and continuous drift detection where schema changes trigger LLM re-generation proposals.
Address onboarding new source systems rapidly, cross-company entity resolution (no shared keys), cultural differences in data standards, phased consolidation strategies, and maintaining a transitional 'bridge' layer while full MDM integration is completed.
Describe federated governance with central standards, domain-owned MDM services publishing golden records as data products, a central catalog for discovery, and automated contract testing that validates cross-domain consistency.
Scenario-Based
10 questionsWalk through data profiling, defining match keys (email, phone, loyalty ID), implementing probabilistic matching with blocking (email domain + first 3 chars of name), training a classifier on steward-labeled pairs, setting survivorship rules per field, and deploying golden records to a CDP.
Emphasize the critical safety implications, describe adding hard constraints (insurance ID, MRN), implementing a 'review required' threshold for healthcare matches, involving clinical data stewards, and designing reversible merges with full audit trails.
Describe checking schema changes, data format shifts (new encoding, null handling), profiling the new source feed, comparing blocking key distributions before and after, retraining the matching model if needed, and establishing pre-integration data contract validation.
Discuss a global customer master with jurisdiction-aware attributes, configurable matching per region (e.g., Aadhaar in India, SSN in US), regulatory rule engines that flag incomplete records per jurisdiction, and integration with sanctions/PEP screening APIs.
Explain raising the auto-merge confidence threshold, implementing active learning so the model learns from steward decisions, using LLMs to generate natural-language explanations for match suggestions (increasing steward confidence), and adding batch approval workflows for high-confidence sets.
Cover parallel-run phase, data mapping between legacy and new models, re-validation of matching rules (don't assume old rules transfer directly), stakeholder UAT, phased domain migration, and rollback planning.
Describe cataloging all product hierarchies, using NLP to extract and align attributes from free-text descriptions, building a canonical product model, implementing classification-based matching (not just dedup), and establishing a product data governance council with regional representatives.
Examine match/merge accuracy trends, check if new false merges are introducing noise (mixing behavior of two different customers), validate field-level completeness and freshness, and set up MDM quality metrics as features or filters in the ML pipeline.
Cover deduplication rate (e.g., reduced 15M to 10.5M records = 30% cost savings in mailing), improved match accuracy driving revenue uplift in personalized marketing, compliance fine avoidance, reduced data steward hours, and before/after data quality scores.
Describe a batch layer for full re-matching and golden record publication, a real-time API layer backed by an indexed golden record store (Elasticsearch or Redis), event-driven updates that propagate changes to the real-time layer, and consistency guarantees between the two.
AI Workflow & Tools
10 questionsDescribe generating embeddings for all product descriptions using a model like all-MiniLM-L6-v2, computing cosine similarity, applying a threshold, evaluating with a manually labeled sample, and potentially fine-tuning on domain-specific product pairs.
Explain a LangChain agent that takes a record pair, retrieves the matching model's comparison vector, uses an LLM to translate feature weights into a natural-language explanation ('These records matched primarily because the email addresses are identical and the names have 92% Jaro-Winkler similarity'), and suggests actions.
Explain few-shot prompting with taxonomy examples, structured output (JSON with category, subcategory, confidence), batch processing with rate limiting, human-in-the-loop review for low-confidence classifications, and fine-tuning on domain-specific labeled data if accuracy is insufficient.
Describe defining comparison columns, training the model with labeled matches, reviewing the match weights and u-probabilities, adjusting blocking rules to improve recall, using Splink's comparison viewer dashboard for quality assessment, and exporting deterministic match rules for production deployment.
Describe defining expectations (not_null on critical fields, unique on golden record keys, value_set on status fields, regex_match on email/phone), scheduling checkpoint runs after each pipeline execution, and routing failures to alerting via Slack or PagerDuty.
Describe modeling suppliers, parent companies, and subsidiaries as nodes with OWNS, SUBSIDIARY_OF edges; using Cypher queries for hierarchical traversal (e.g., 'find all Tier 2 suppliers under a Tier 1'); and using graph algorithms like PageRank to identify critical suppliers.
Explain labeling training data with product attributes (brand, material, dimensions, weight), fine-tuning a BERT-based NER model, evaluating with precision/recall per entity type, deploying as a microservice, and integrating the output into the MDM standardization pipeline.
Describe building dbt models for staging (source profiling), intermediate (standardization, blocking key generation), and mart (golden record) layers; using dbt tests for uniqueness, referential integrity, and freshness; and documenting models for data steward consumption.
Describe using LangChain with a SQL agent connected to the MDM monitoring database, a prompt template that translates natural language questions into quality metric queries, and a response format that includes visualizations (charts generated by matplotlib) alongside narrative explanations.
Explain profiling each source feed's expected distributions (field completeness, value ranges, record volumes), training an anomaly detection model (isolation forest or simple statistical thresholds), running checks at ingestion time, and quarantining suspect batches for steward review before they enter the MDM hub.
Behavioral
5 questionsLook for the candidate tying MDM to business outcomes (revenue, compliance, customer experience), using concrete metrics, telling a compelling story, and demonstrating empathy for stakeholder priorities rather than just technical merits.
Expect discussion of urgency assessment, stakeholder communication, immediate containment, root cause analysis, remediation, and post-incident preventive measures - showing both technical skill and professional responsibility.
Look for respect for domain expertise, data-driven dialogue (showing them precision/recall metrics), willingness to incorporate their knowledge into model improvements, and collaborative approaches to resolving edge cases.
Assess the candidate's ability to scope MVP vs. ideal solution, communicate trade-offs, manage stakeholder expectations, and still deliver compliant results without cutting critical corners on data quality.
Expect mention of structured experimentation (POCs with metrics), reading papers/blogs, community participation, and a pragmatic approach - not chasing every new tool but having a systematic evaluation framework.