AI Knowledge Curator
AI Knowledge Curators design, organize, and maintain the structured knowledge ecosystems that power AI systems - from RAG pipeline…
Skill Guide
Python scripting for data processing and API integration is the practice of writing Python code to automate the extraction, transformation, and loading (ETL) of data from disparate sources, including internal databases and external services via their Application Programming Interfaces (APIs).
Scenario
Create a script that fetches daily weather data from a free public API (like OpenWeatherMap) for a list of cities and saves it to a CSV file.
Scenario
Build a script that extracts new sales leads from a CRM API (e.g., Salesforce or HubSpot), cleans and transforms the data (standardizing phone numbers, mapping stages), and loads it into a destination like a PostgreSQL database or a Google Sheet.
Scenario
Design and build a system that aggregates financial data (stock prices from Alpha Vantage, news sentiment from NewsAPI, and social media mentions from Twitter), performs correlation analysis, and generates a daily automated report with visualizations.
`pandas` is the cornerstone for in-memory data manipulation. `requests` is the standard for HTTP interactions. `SQLAlchemy` provides a robust ORM for database interactions. `python-dotenv` manages secrets and configuration via environment variables.
For scaling beyond single-machine `pandas` limitations, `Dask` provides parallel computing. `Polars` is a high-performance DataFrame library. `NumPy` underpins `pandas` for numerical operations.
`Prefect` and `Airflow` orchestrate complex, multi-step data pipelines. `Docker` containerizes scripts for consistent execution. `GitHub Actions` automates testing and deployment (CI/CD).
`Postman` is essential for manually testing and debugging APIs before scripting. Understanding the `OpenAPI Spec` (Swagger) helps auto-generate client code. `httpx` is a modern alternative to `requests` with async support for I/O-bound concurrency.
Answer Strategy
The candidate must demonstrate knowledge of pagination patterns, rate limiting, and error handling. A strong answer will include: 1) Identifying the pagination style (offset, cursor, link-header). 2) Implementing a loop with a delay (e.g., `time.sleep()`) or a more sophisticated token-bucket rate limiter. 3) Using `try-except` blocks to handle HTTP 429 (Too Many Requests) responses and network errors, possibly with retries. 4) Considering checkpointing to resume from the last successful page if the script fails midway.
Answer Strategy
This tests practical data wrangling skills. The candidate should outline a clear, step-by-step process: 1) **Profiling**: Quickly assess the data schema, missing values, and inconsistent formats. 2) **Handling Inconsistencies**: Provide concrete examples (e.g., normalizing date strings with `pd.to_datetime`, mapping categorical variables with `.map()`, handling missing values with `.fillna()` or imputation). 3) **Validation**: Mention adding checks (e.g., `assert` statements, schema validation with `pandera` or `pydantic`). 4) **Documentation**: Emphasize the importance of documenting transformation logic for reproducibility. A sample response: 'I once received JSON customer data with inconsistent country codes and null emails. I first used `pandas` to profile the data, finding ~15% null emails and 3 different codes for the US. I standardized countries using a mapping dictionary, used domain logic to impute some missing emails, and dropped records that were incomplete for critical analysis fields, logging all transformations for auditability.'
1 career found
Try a different search term.