Skill Guide

Data cleaning and normalization with Python (pandas, polars, NumPy)

The systematic process of identifying and correcting corrupt, inaccurate, or irrelevant records and transforming data into a consistent, usable format using Python's core data manipulation libraries.

It directly impacts data integrity, forming the foundation for all subsequent analysis, machine learning, and business intelligence. Poor data cleaning leads to flawed insights and erroneous model predictions, making this skill critical for ROI on data projects.

1 Careers

1 Categories

9.0 Avg Demand

25% Avg AI Risk

How to Learn Data cleaning and normalization with Python (pandas, polars, NumPy)

Focus on pandas DataFrames: indexing, selection, and basic column operations. Learn to identify and handle missing values (`isna()`, `fillna()`, `dropna()`) and basic data type conversion (`astype()`). Practice loading data from common formats (CSV, Excel) into a structured DataFrame.

Move to advanced cleaning with `apply()` and vectorized string operations. Master merging, joining, and reshaping data (`merge()`, `pivot_table()`, `melt()`). Learn Polars for its performance benefits on large datasets and understand the differences in syntax. Implement basic normalization techniques (min-max, z-score) using NumPy or pandas.

Architect data pipelines that incorporate cleaning as a preprocessing step. Optimize memory usage and processing speed for large-scale data using Polars' lazy evaluation and parallel processing. Develop custom cleaning functions and decorators, and establish data validation rules (e.g., using pandas `validation` or Great Expectations). Mentor others on best practices for reproducible and auditable data transformations.

Practice Projects

Beginner

Project

Customer Contact List Deduplication & Standardization

Scenario

You receive a messy CSV file containing customer contact information with inconsistent capitalization, misspelled names, duplicate entries, and missing phone numbers.

How to Execute

1. Load the CSV into a pandas DataFrame. 2. Standardize string fields (e.g., `str.lower()`, `str.strip()`). 3. Use `drop_duplicates()` based on key columns (e.g., email) to remove exact matches. 4. Identify and fill missing 'phone' fields with a placeholder or use `interpolate()` if contextually appropriate.

Intermediate

Project

Sales Data Aggregation with Time-Series Alignment

Scenario

Combine sales data from two regional CSV files with different date formats and currency symbols. Then, normalize sales figures to a common currency (USD) and aggregate weekly sales totals.

How to Execute

1. Load both files and use `pd.to_datetime()` to parse dates into a unified format. 2. Create a currency conversion function using NumPy for vectorized operations. 3. Merge the datasets on 'product_id' and 'date'. 4. Group by week (`resample('W')` or `groupby` with a week-start column) and compute total sales, handling any missing values created by the merge.

Advanced

Project

Build a Robust ETL Preprocessing Module for a Data Warehouse

Scenario

Design a reusable Python module that ingests raw, semi-structured JSON data from an API, applies a series of cleaning and normalization rules, and outputs a clean Parquet file for analytical querying. The module must handle schema evolution and log all cleaning actions.

How to Execute

1. Use Polars with `scan_json` for lazy loading of large JSON files to optimize memory. 2. Define a schema validation and cleaning pipeline as a chain of Polars expressions. 3. Implement a logging decorator to record every transformation (e.g., 'Normalized column X: converted to float, imputed 5% nulls'). 4. Package the module with proper error handling and unit tests to ensure it processes new data batches predictably.

Tools & Frameworks

Core Libraries

pandaspolarsNumPy

pandas is the industry standard for tabular data manipulation. Polars offers superior performance for large datasets with its Rust-based backend and query optimization. NumPy underpins numerical operations and array processing, essential for vectorized transformations.

Data Validation & Profiling

Great Expectationspandas-profiling (ydata-profiling)pydantic

Great Expectations defines, documents, and validates data expectations. pandas-profiling generates comprehensive data quality reports. Pydantic is used for data validation and settings management using Python type annotations.

Development Environment

Jupyter Notebooks/LabDaskVaex

Jupyter is for interactive exploration and iterative cleaning. Dask scales pandas code to larger-than-memory datasets. Vaex provides lazy, out-of-core DataFrames for efficient exploration of huge tabular data.

Interview Questions

Answer Strategy

The interviewer is testing your understanding of the `duplicated()` method, subset selection, and logical reasoning for data integrity. Strategy: Explain the use of `keep='last'` or `keep='first'` based on business logic, then demonstrate using `sort_values()` before `drop_duplicates()` to control which record is retained. Sample: 'First, I'd sort the DataFrame by 'transaction_date' descending to prioritize recent records. Then, I'd use `df.drop_duplicates(subset=['customer_id'], keep='first')` after the sort to retain the latest transaction for each customer, ensuring the cleaned data reflects the most current information.'

Answer Strategy

This assesses knowledge of normalization techniques and their appropriate application. The core competency is feature scaling. Strategy: Differentiate between Min-Max Scaling and Standardization (Z-score). Sample: 'I would assess the data distribution. For algorithms sensitive to scale like SVM or KNN, I'd use Min-Max Scaling to bound features between 0 and 1. For models assuming Gaussian distribution like linear regression, I'd use Z-score standardization. I'd implement this using `sklearn.preprocessing` or pandas with NumPy, ensuring the scaler is fit only on training data to prevent data leakage.'