Skill Guide

Data quality profiling, anomaly detection, and completeness scoring

Data quality profiling is the systematic analysis of datasets to assess structure, content, and relationships; anomaly detection is the identification of data points that deviate significantly from expected patterns; completeness scoring is the quantification of missing or null values against a defined schema to measure data reliability.

It is foundational to data-driven decision-making, as poor data quality directly erodes trust in analytics and leads to flawed business insights. Mastering this skill mitigates financial risk, ensures regulatory compliance, and optimizes operational efficiency by ensuring the integrity of the data supply chain.

1 Careers

1 Categories

8.7 Avg Demand

25% Avg AI Risk

How to Learn Data quality profiling, anomaly detection, and completeness scoring

Focus on: 1) Understanding data profiling metrics (cardinality, uniqueness, null ratios, data type distributions). 2) Learning basic descriptive statistics (mean, median, standard deviation, min/max) for initial anomaly flagging. 3) Grasping schema validation principles (e.g., JSON Schema, SQL CHECK constraints) to define completeness rules.

Move to practice by: Implementing automated profiling jobs using tools like Great Expectations or Pandas Profiling on sample datasets (e.g., customer transaction logs). Common mistakes include over-relying on automated reports without business context, and failing to establish a baseline of 'normal' before detecting anomalies. Focus on creating data quality scorecards for key data products.

Mastery involves: Designing and governing enterprise-wide data quality monitoring frameworks that integrate with data pipelines (e.g., using dbt tests, Dagster sensors). This includes establishing data SLAs, leading root cause analysis for systemic data issues, and aligning quality metrics with business KPIs (e.g., correlating data freshness with marketing campaign performance).

Practice Projects

Beginner

Project

Retail E-commerce Dataset Profiling & Quality Report

Scenario

You are given a CSV file of 100,000 product sales records containing columns: order_id, product_id, customer_id, order_date, quantity, unit_price, and shipping_status.

How to Execute

1) Load data using Pandas and run df.describe(), df.info(), and check null counts per column. 2) Calculate uniqueness ratios for key IDs (order_id should be 100% unique). 3) Identify numeric outliers in quantity and unit_price using IQR or Z-score. 4) Compile a markdown report summarizing completeness scores (e.g., shipping_status 85% complete) and key anomalies.

Intermediate

Project

Automated Data Quality Pipeline with Great Expectations

Scenario

You manage a daily batch pipeline loading user event logs (JSON) into a data warehouse. Stakeholders report sporadic 'garbage' data breaking downstream dashboards.

How to Execute

1) Define a suite of expectations in Great Expectations (e.g., column 'user_id' must not be null, 'event_type' must be in a known set, 'timestamp' must be within the last 90 days). 2) Integrate the validation step as a Python task in your orchestration tool (Airflow, Prefect). 3) Set up a checkpoint to quarantine failing batches and alert the data team via Slack/email. 4) Generate a Data Docs HTML report for stakeholders.

Advanced

Project

Enterprise Data Quality Scorecard & Root Cause Remediation

Scenario

The Chief Data Officer mandates a company-wide data quality scorecard. Critical customer data in the CRM (Salesforce) and the billing system (NetSuite) has inconsistent key attributes (e.g., industry codes), impacting sales forecasting accuracy.

How to Execute

1) Define a DQ framework with weighted dimensions: completeness, consistency (cross-system matching), timeliness, and accuracy (validated against source). 2) Implement cross-system reconciliation checks using a tool like Ataccama or Informatica DQ. 3) Establish a data stewardship process where owners are accountable for scorecard metrics. 4) Lead a root cause analysis workshop, then implement upstream system fixes (e.g., mandatory fields in Salesforce) and a master data management (MDM) solution.

Tools & Frameworks

Software & Platforms

Great ExpectationsPandas Profiling / ydata-profilingdbt (data build tool) TestsAWS Glue DataBrewAtaccama ONE

Use Great Expectations for declarative, pipeline-integrated testing. Pandas Profiling for rapid, exploratory analysis in notebooks. dbt Tests for defining and running data quality checks directly within SQL-based transformation layers. AWS Glue DataBrew for visual profiling on cloud data lakes. Ataccama for enterprise-scale, governed data quality management.

Statistical & Algorithmic Methods

Z-score / Modified Z-scoreInterquartile Range (IQR)Isolation ForestExpectation-Maximization (EM) for missing data

Apply Z-score and IQR for simple, univariate numeric anomaly detection. Use Isolation Forest for efficient, high-dimensional anomaly detection without labeling. Leverage EM algorithms to understand and impute data patterns for completeness scoring when data is missing not at random (MNAR).

Frameworks & Methodologies

TDQM (Total Data Quality Management)ISO 8000 Data Quality StandardSix Sigma DMAIC for Data

TDQM provides a holistic management framework. ISO 8000 offers formal specifications for master data quality. Apply the DMAIC (Define, Measure, Analyze, Improve, Control) cycle to systematically identify, quantify, and root-cause data quality issues.

Interview Questions

Answer Strategy

The interviewer is testing your understanding of real-time constraints, statistical process control, and system design. Frame your answer around the 4 V's of data (Volume, Velocity, Veracity). Sample answer: 'I'd implement a two-tier approach. First, for velocity and volume, use stream processing (Kafka Streams/Flink) to apply lightweight rule-based checks (e.g., value within bounds, timestamp sequence). Second, for deeper statistical veracity, maintain a rolling window of data in memory to compute dynamic Z-scores or use a streaming anomaly detection model like RCF. Alerts would trigger on rule violations or model score thresholds.'

Answer Strategy

This is a behavioral question testing stakeholder management, communication, and problem-solving. Use the STAR (Situation, Task, Action, Result) method. Sample answer: 'Situation: I found that our customer segmentation model was using a region field with 30% missing data, skewing marketing campaign targeting. Task: I needed to quantify the revenue impact and fix the pipeline. Action: I first halted the faulty campaign launch. Then, I led a root cause analysis tracing the nulls to a failed API integration. I implemented a retry mechanism and a completeness check in our dbt pipeline. Result: We prevented an estimated $500k in misallocated ad spend and now have a daily DQ dashboard for the marketing ops team.'