Skill Guide

Python data analysis with pandas, NumPy, and polars

The applied skill of using Python's pandas, NumPy, and polars libraries to ingest, clean, transform, analyze, and model structured data for insight generation.

This skill is the backbone of data-driven decision-making, enabling organizations to extract actionable insights from raw data at scale. Proficiency directly impacts business outcomes by accelerating analysis cycles, improving data quality, and enabling complex predictive modeling.

1 Careers

1 Categories

8.5 Avg Demand

20% Avg AI Risk

How to Learn Python data analysis with pandas, NumPy, and polars

Master core data structures: NumPy arrays and pandas DataFrames. Focus on fundamental operations: indexing, slicing, and vectorized calculations. Build a habit of always checking data types and handling missing values explicitly using `info()`, `dtypes`, and `isna()`.

Move to real-world scenarios by performing end-to-end analysis: data ingestion (CSV, Excel, SQL), cleaning (string operations, datetime parsing, outlier handling), and aggregation (groupby, pivot_table). Avoid common pitfalls like chained indexing and iterating over rows with `iterrows()`; use `.apply()` or vectorized methods instead.

Architect scalable data pipelines. Optimize for performance using polars for large datasets (leveraging its lazy execution and multi-threading), or integrate pandas with Dask for out-of-core computation. Mentor juniors on best practices, establish coding standards, and align analysis with business KPIs.

Practice Projects

Beginner

Project

Sales Data Cleaning & Basic Analysis

Scenario

You have a raw CSV file of quarterly sales data with missing values, inconsistent date formats, and duplicate rows.

How to Execute

1. Load data with `pd.read_csv()` and inspect with `.shape`, `.info()`, and `.head()`. 2. Clean data: remove duplicates with `.drop_duplicates()`, handle missing values with `.fillna()` or `.dropna()`, and standardize date columns with `pd.to_datetime()`. 3. Perform basic analysis: calculate total sales per region using `.groupby('Region')['Sales'].sum()` and visualize the result with `matplotlib` or `seaborn`.

Intermediate

Project

Customer Segmentation & Cohort Analysis

Scenario

Analyze an e-commerce transaction log to segment customers based on purchase behavior and perform a retention cohort analysis.

How to Execute

1. Prepare data: compute RFM (Recency, Frequency, Monetary) metrics per customer using `groupby` and `agg` functions. 2. Segment customers into clusters (e.g., 'Champions', 'At Risk') using a simple rule-based system or a clustering algorithm (e.g., KMeans from `sklearn`). 3. For cohort analysis, assign each customer to a monthly cohort based on first purchase date. Calculate retention rates by cohort month using a pivot table and visualize the retention matrix as a heatmap.

Advanced

Project

High-Performance Data Pipeline for Financial Tick Data

Scenario

Design and implement a data pipeline to process millions of rows of stock tick data (time, price, volume) for real-time analytics, minimizing memory footprint and execution time.

How to Execute

1. Evaluate library choice: use `polars` for its Rust-based, multi-threaded performance on large datasets. Read data using `pl.scan_csv()` for lazy loading. 2. Implement transformations: calculate VWAP (Volume-Weighted Average Price) and moving averages using window functions (`pl.col('price').rolling_mean(window_size=5)`). 3. Optimize: use polars' query planning with `.collect()` only at the end, partition data by date for out-of-memory processing, and use Apache Parquet for efficient storage. 4. Architect the pipeline to run as a scheduled batch job or integrate with a streaming framework like Apache Kafka.

Tools & Frameworks

Core Libraries & Packages

pandasNumPypolarsPyArrowSQLAlchemy

pandas and NumPy are the standard for exploratory analysis and modeling. polars is for high-performance, large-scale data processing. PyArrow is for efficient in-memory data interchange and Parquet I/O. SQLAlchemy is for database connectivity.

Development & Execution Environment

Jupyter Notebooks/LabVS Code with Python extensionDaskApache Spark (PySpark)

Jupyter is for interactive exploration and reporting. VS Code is for script development and debugging. Dask and PySpark are frameworks for scaling pandas-like operations across clusters for truly big data.

Data Visualization & Reporting

MatplotlibSeabornPlotlyStreamlit

Matplotlib and Seaborn are for static, publication-quality plots. Plotly is for interactive web-based visualizations. Streamlit is for building data apps and dashboards from Python scripts.

Interview Questions

Answer Strategy

The candidate must demonstrate understanding of single-threaded vs. multi-threaded execution, eager vs. lazy evaluation, and memory management. Sample answer: 'pandas operates in-memory on a single thread with eager execution, which is flexible for small-to-medium data but can hit memory limits. polars is built on Rust, uses multi-threading by default, and employs lazy evaluation-transformations are optimized and executed only upon collection. Choose polars for large datasets (>>10GB) where performance is critical, and pandas for complex exploratory analysis on data that fits in memory.'

Answer Strategy

This tests practical data wrangling skills and knowledge of pandas' JSON handling. The interviewer is looking for a methodical approach. Sample answer: 'First, I would use `pd.read_json()` with `orient='records'` or `json_normalize()` if the JSON is nested. I'd apply it to the column via `df['json_col'].apply(pd.io.json.json_normalize)`. Crucially, I'd handle parsing errors using a try-except block inside a custom function and then concatenate the resulting normalized DataFrames back into the main DataFrame, checking for alignment of indices.'