AI Data Catalog Specialist
An AI Data Catalog Specialist designs, curates, and governs metadata-rich data catalogs that power AI and ML initiatives across th…
Skill Guide
The development of Python programs to automatically manage product or content catalogs by programmatically interfacing with external or internal APIs to fetch, transform, validate, and push data, eliminating manual data entry and synchronization.
Scenario
Fetch a list of 'products' from the public JSONPlaceholder API (simulating a supplier feed), clean the data, and write it into a local CSV file that simulates your internal catalog format.
Scenario
You have an internal list of product SKUs in a CSV file. Your task is to query a mock supplier API (create one using FastAPI or Flask) for each SKU to get its current price and availability, then update your internal CSV with this new information, handling cases where an SKU is not found.
Scenario
Build a pipeline that aggregates product data from three different sources: 1) a supplier REST API (JSON), 2) an internal ERP's legacy XML file feed, and 3) a partner's SFTP CSV drop. The script must reconcile data, resolve conflicts (e.g., different prices for the same SKU), validate completeness, and generate a final unified catalog report and exception log.
`requests` is the standard for synchronous HTTP. `httpx` provides async support for high-concurrency calls. `pandas` is essential for data wrangling and transformation of tabular catalog data. `json` and `csv` are for standard data format handling.
Use Pydantic to define strict data models for API responses and catalog entries. It provides automatic validation, serialization, and documentation, ensuring data integrity before it enters your system.
For running scripts on a schedule. `APScheduler` is simple for in-process scheduling. `Celery` is a distributed task queue for scaling jobs across workers. `Airflow` is an enterprise-grade orchestrator for complex, multi-step data pipelines with monitoring.
`paramiko` for programmatic SFTP access to file-based catalogs. `requests-oauthlib` handles OAuth 2.0 flows required by modern APIs (e.g., Google, Salesforce). Understand webhooks for event-driven, real-time updates instead of polling.
Answer Strategy
Demonstrate a structured, production-minded approach. Focus on resilience, logging, and idempotency. Sample Answer: 'I start by studying the API documentation to understand endpoints, pagination, and rate limits. I implement the script using the `requests` library with a session object. For resilience, I wrap calls in a retry decorator with exponential backoff for transient errors and implement logic to handle 429 status codes by respecting `Retry-After` headers. I process data in chunks, validate each item against a Pydantic model, and use an UPSERT pattern in the database write to ensure idempotency. I log all failures and mismatches to a separate file for review.'
Answer Strategy
Test system thinking, risk assessment, and incremental improvement. The focus is on strategy, not just rewriting code. Sample Answer: 'First, I would assess the script's inputs, outputs, and failure modes without changing it. I'd set up comprehensive logging and monitoring for the current production process. My modernization strategy would be incremental: 1) Port critical sections to Python 3, adding unit tests for core logic. 2) Refactor the data transformation layer using Pydantic for validation. 3) Abstract the old and new API calls behind a common interface using an adapter pattern, allowing me to implement the new API integration without disrupting existing flows. 4) Finally, replace the old connector, using feature flags for safe rollout. The key is maintaining business continuity throughout.'
1 career found
Try a different search term.