Skill Guide

Data quality assurance - deduplication, entity resolution, consistency checking

The systematic process of identifying and rectifying duplicate records, linking disparate data entries to real-world entities, and ensuring uniform adherence to business rules across datasets.

This skill directly underpins data trustworthiness, which is the foundation for accurate analytics, reliable machine learning models, and regulatory compliance. Organizations with high data quality see significantly reduced operational costs from error correction and gain a competitive advantage through superior decision-making.

1 Careers

1 Categories

9.0 Avg Demand

18% Avg AI Risk

How to Learn Data quality assurance - deduplication, entity resolution, consistency checking

1. **Master Core Data Profiling:** Learn to use basic SQL (COUNT, DISTINCT, GROUP BY, HAVING) and tools like pandas to identify duplicates and basic inconsistencies. 2. **Understand Key Concepts:** Study the definitions of 'exact match' vs. 'fuzzy match' deduplication, 'golden record,' and 'source system of truth.' 3. **Practice on Small Datasets:** Use public datasets (e.g., U.S. Census names) to manually identify and merge duplicate entries.

1. **Implement Probabilistic Matching:** Move beyond exact matches to learn algorithms like Levenshtein distance, Jaro-Winkler, and Soundex for fuzzy matching. Use libraries like Python's `fuzzywuzzy` or `recordlinkage`. 2. **Design a Deduplication Pipeline:** Build a workflow that includes blocking (to reduce comparison pairs), scoring, and threshold-based decisioning. 3. **Define and Enforce Business Rules:** Translate business logic (e.g., 'A customer can have only one active address') into programmatic consistency checks.

1. **Architect Scalable Solutions:** Design entity resolution systems that handle billions of records using distributed computing frameworks (Spark, Databricks) and specialized platforms (Tamr, Senzing). 2. **Develop Governance Frameworks:** Create and enforce data quality SLAs, stewardship roles, and MDM (Master Data Management) strategies. 3. **Integrate with ML:** Use machine learning models to improve matching accuracy over time and to predict and prevent data quality issues at the point of entry.

Practice Projects

Beginner

Project

Customer List Deduplication

Scenario

You have a CSV file from a sales team with 10,000 customer records containing typos, nicknames (Bill/William), and slight address variations. The goal is to create a clean, deduplicated list.

How to Execute

1. Load data into a pandas DataFrame. 2. Standardize fields (lowercase, trim whitespace). 3. Implement blocking on a field like 'email_domain' or 'zip_code' to create comparison blocks. 4. Within blocks, compute similarity scores for name and address columns using Levenshtein ratio. 5. Set a threshold (e.g., 85%) to flag potential duplicates for manual review.

Intermediate

Project

Product Information Management (PIM) Consistency Check

Scenario

An e-commerce company has product data flowing from multiple suppliers into a central database. Specifications like 'weight' and 'dimensions' are in different units and formats, and SKUs are sometimes duplicated.

How to Execute

1. Profile all source feeds to catalog inconsistencies (e.g., 'kg' vs. 'lbs', 'cm' vs 'in'). 2. Write a validation script (Python or SQL) that flags records violating predefined rules (e.g., 'weight must be in kg'). 3. Implement a deduplication engine using a unique SKU as a primary key and fuzzy matching on product title and description for non-exact duplicates. 4. Create a dashboard showing data quality scores per source supplier.

Advanced

Project

Cross-System Patient Identity Resolution for a Healthcare Provider

Scenario

A hospital network is merging data from three legacy EHR systems. The same patient may have different MRNs, slight variations in name/date of birth, and conflicting records for insurance and contact information. The goal is to create a single, unified patient index.

How to Execute

1. Design a probabilistic entity resolution model with weighted attributes (e.g., name=0.4, SSN=0.9, DOB=0.8). 2. Implement a scalable blocking strategy on high-discriminatory fields like last name soundex and zip code. 3. Use a network graph to cluster records that are transitively linked through shared attributes. 4. Develop a golden record survivorship strategy (e.g., 'most recent' for contact info, 'source of truth' flag for demographics). 5. Establish a stewardship workflow for human review of ambiguous clusters.

Tools & Frameworks

Software & Platforms

Apache Spark / Databricks (Scala/Python)SQL (Advanced)Python Libraries: recordlinkage, fuzzywuzzy, splinkSpecialized MDM/ER Platforms: Tamr, Informatica MDM, Senzing

Spark is for large-scale data processing and implementing custom ER logic at scale. Advanced SQL is for in-database profiling and rule enforcement. Python libraries are for prototyping and mid-scale matching. Specialized platforms offer pre-built, scalable entity resolution and stewardship workflows.

Mental Models & Methodologies

Probabilistic Record Linkage (Fellegi-Sunter Model)Data Quality Dimensions Framework (Completeness, Consistency, Accuracy, Timeliness)Master Data Management (MDM) Architecture Patterns (Registry, Coexistence, Transactional)

The Fellegi-Sunter model is the statistical foundation for probabilistic matching. The Data Quality Dimensions framework provides a vocabulary for defining and measuring quality. MDM patterns guide the architectural choice for how and where to consolidate golden records.

Interview Questions

Answer Strategy

The interviewer is testing system design and practical algorithm knowledge. Structure the answer as a pipeline: **1. Pre-processing & Profiling:** Standardize and analyze attribute distributions. **2. Blocking:** Choose blocking keys (e.g., soundex(last_name) + zip_code, email domain) to reduce the O(n^2) comparison space. **3. Pairwise Comparison:** Apply weighted similarity functions (Jaro-Winkler for names, Levenshtein for addresses) to candidate pairs. **4. Clustering & Decisioning:** Use a graph-based clustering algorithm (e.g., connected components) to group records, applying a match threshold. **5. Survivorship & Golden Record Creation:** Define business rules (e.g., 'most recent, highest quality source') to create a master record. Mention scalability (Spark) and monitoring (data quality dashboards).

Answer Strategy

The core competency is problem-solving, communication, and systemic thinking. Use the STAR method (Situation, Task, Action, Result). Focus on the technical discovery (e.g., a failed consistency check, a spike in duplicates), the immediate fix (data patch, process halt), and the long-term prevention (rule implementation, pipeline validation, monitoring).