Skill Guide

Real-world data source evaluation and data quality assessment

The systematic process of verifying the reliability, relevance, and integrity of data sources, then applying structured frameworks to quantify and remediate issues like incompleteness, inconsistency, and bias before data enters any analytical or operational pipeline.

This skill directly prevents costly model failures, flawed business intelligence, and compliance risks by ensuring decision-making is built on a foundation of trustworthy, auditable data. It shifts data work from a 'garbage-in, garbage-out' cost center to a strategic asset that accelerates time-to-insight and builds organizational credibility.

1 Careers

1 Categories

8.9 Avg Demand

15% Avg AI Risk

How to Learn Real-world data source evaluation and data quality assessment

1. Master the core dimensions of data quality: accuracy, completeness, consistency, timeliness, uniqueness, and validity. 2. Learn to document data provenance-where data comes from, how it's collected, and any transformations it undergoes. 3. Practice profiling datasets using basic descriptive statistics (mean, median, null counts, cardinality) to identify obvious anomalies.

1. Move from reactive profiling to proactive assessment by creating data quality scorecards and SLAs for key data pipelines. 2. Apply root cause analysis (e.g., the '5 Whys') to recurring data issues, distinguishing between source-system errors, integration bugs, and semantic misunderstandings. Common mistake: Fixing symptoms (e.g., cleaning nulls) without addressing the upstream process that generated them. 3. Evaluate source credibility by examining its methodology, sample size, and potential biases (selection bias, survivorship bias).

1. Architect data quality frameworks that are embedded into data mesh or data product ownership models, defining quality as a non-functional requirement. 2. Conduct strategic data sourcing assessments for new initiatives, balancing cost, speed, quality, and legal/ethical constraints (e.g., GDPR, PII). 3. Mentor teams by establishing data quality governance councils and championing a culture of data accountability across business and engineering units.

Practice Projects

Beginner

Case Study/Exercise

Profiling a Public Dataset for Retail Sales

Scenario

You are given a raw CSV file containing 1 million rows of e-commerce transaction data from a third-party provider. The marketing team wants to use it for customer segmentation, but you suspect issues.

How to Execute

1. Use a tool like pandas-profiling or great_expectations to generate an automated report. 2. Manually validate a sample of 100 rows against the data dictionary for schema conformance (e.g., date formats, category codes). 3. Quantify the severity: Calculate the percentage of missing values in the 'customer_id' column and the number of duplicate transaction_ids. Document findings in a structured 'Data Quality Assessment Report' template.

Intermediate

Project

Build a Data Quality Dashboard for a Critical Pipeline

Scenario

Your company's key sales dashboard is powered by an ETL pipeline that aggregates data from Salesforce, a legacy ERP, and a web analytics API. Stakeholders have complained about mismatched revenue figures.

How to Execute

1. Define 3-5 key data quality metrics for the pipeline (e.g., 'Revenue Conformity' between Salesforce and ERP, 'Session Completeness' from web analytics). 2. Implement automated data quality checks using a framework like Great Expectations or dbt tests within the pipeline DAG. 3. Create a live dashboard (in Looker, Tableau, or Grafana) that visualizes these quality metrics over time, highlighting failures and triggering alerts. 4. Present the root cause analysis for one major discrepancy, proposing a fix to the upstream data contract.

Advanced

Project

Strategic Sourcing & Quality Assessment for a New AI Product

Scenario

You are the data lead for a new AI-powered fraud detection product. You must evaluate and select between three potential data sources: an internal historical transaction database, a real-time stream from a third-party vendor, and a consortium data pool shared by industry partners.

How to Execute

1. Develop a weighted scoring matrix evaluating each source across 6 dimensions: accuracy (ground truth availability), timeliness (latency), coverage (customer population), cost, legal compliance (consent, GDPR), and technical integration effort. 2. Design a proof-of-concept assessment: For each source, build a mini-pipeline and measure its performance on a labeled test set of known fraudulent/non-fraudulent transactions. 3. Model the total cost of ownership (TCO), factoring in data acquisition, storage, compute, and ongoing quality monitoring overhead. 4. Deliver a strategic recommendation report to leadership, outlining trade-offs, risks, and a phased integration plan.

Tools & Frameworks

Data Profiling & Quality Frameworks

Great Expectationsdbt (data build tool) Testspandas-profiling / ydata-profiling

Great Expectations is the industry standard for data validation, documentation, and profiling within pipelines. dbt tests are essential for validating data models post-transformation. pandas-profiling is a rapid, exploratory tool for initial dataset assessment in a notebook environment.

Mental Models & Methodologies

Data Quality Dimensions (Accuracy, Completeness, Consistency, Timeliness, Uniqueness, Validity)Root Cause Analysis (5 Whys, Fishbone Diagrams)Data Provenance & Lineage TrackingCost-Benefit Analysis for Data Sourcing

The Dimensions framework provides the objective criteria for assessment. Root Cause Analysis ensures you solve systemic issues, not symptoms. Provenance tracking is critical for debugging and auditing. Cost-Benefit analysis structures the business case for data investments.

Interview Questions

Answer Strategy

Structure your answer using a phased approach: 1) Preliminary Vetting (examine provider's methodology, sample data, SLAs), 2) Technical Validation (run automated profiling for schema conformance, distribution anomalies, and null rates), 3) Business Validation (compare against a trusted internal source on key metrics for a sample cohort), 4) Ongoing Monitoring Design (propose specific data quality checks and alerts for the production feed). Sample Answer: 'I'd start with a due diligence phase on the provider's collection methodology to understand inherent biases. Then, I'd validate a historical sample technically for conformance and completeness. The critical step is a business truth test-comparing their values on a known set of entities to our gold-standard data. Finally, I'd design a SLA-driven monitoring contract with checks for timeliness, accuracy drift, and anomaly detection before recommending a purchase.'

Answer Strategy

This tests proactive investigation, impact analysis, and stakeholder management. Use the STAR method (Situation, Task, Action, Result), focusing on your systematic approach. Sample Answer: 'While building a churn model, I noticed a sudden drop in customer activity data. I traced it back to a silent schema change in the event logging API. I quantified the impact by calculating a 30% gap in daily active user metrics over two weeks, which invalidated our model's training set. I presented this to both the data engineering team and the product managers, using the business impact to prioritize a hotfix. I then implemented schema contract tests to prevent recurrence.'