AI Master Data Management Specialist
An AI Master Data Management (MDM) Specialist ensures organizations maintain a single, authoritative, and AI-enhanced source of tr…
Skill Guide
Entity Resolution (ER) is the process of identifying, matching, and linking records that refer to the same real-world entity across disparate data sources, using probabilistic models to handle ambiguity and uncertainty in the matching criteria.
Scenario
You have two CSV files containing customer records from an acquired company and your own CRM. They have overlapping but inconsistent entries (typos, nicknames, different address formats).
Scenario
Integrate product catalogs from three suppliers into a single master catalog. Records have varying attributes (SKU, name, description, weight) and no common unique identifier.
Scenario
A financial institution needs to detect fraudsters using synthetic identities by linking application, transaction, and device data in near-real-time, where fraudsters deliberately manipulate names, addresses, and SSNs.
Use `recordlinkage` for research and prototyping. `Splink` is production-grade for large-scale probabilistic matching on Spark. `dedupe` excels at active learning with minimal labeled data. Spark MLlib is for custom, scalable ER pipelines in big data ecosystems.
Fellegi-Sunter is the foundational probabilistic framework. TF-IDF/BM25 are for unstructured text matching. LSH is critical for efficient blocking on high-dimensional or text data. Graph embeddings capture complex relational signals beyond pairwise matching.
Golden Record defines the output goal. The P-R trade-off is the core tuning lever. Understanding transitive closure (if A=B and B=C, then A=C) prevents error propagation. A well-designed review queue is essential for continuous model improvement.
Answer Strategy
The core competency is translating technical constraints into business impact. Use the 'precision vs. recall' framework as a business risk conversation. Sample answer: 'I was setting a threshold for merging customer records for a marketing campaign. I explained it as a trade-off between customer outreach volume and customer annoyance. A lower threshold means we reach more potential customers (higher recall) but risks sending duplicate offers to the same person, wasting budget and hurting brand perception (lower precision). I presented data: at this threshold, we capture 90% of true matches but might have 5% duplicates. We agreed on a slightly higher threshold, accepting some missed contacts to ensure a clean customer experience.'
1 career found
Try a different search term.