AI Academic Research Assistant Developer
An AI Academic Research Assistant Developer builds intelligent systems that automate and enhance scholarly research workflows, fro…
Skill Guide
The ability to use Python to programmatically extract, transform, load, and analyze structured and unstructured data, coupled with the capability to design, build, and consume robust RESTful APIs to automate data workflows and integrate systems.
Scenario
Build a script that pulls data from a public API (e.g., GitHub's REST API for repositories, or a weather API), cleans the JSON response, and outputs a structured CSV report.
Scenario
Create a service that extracts data from two different internal APIs (e.g., one for sales, one for inventory), merges them on a common key, calculates daily KPIs (like stock turnover ratio), and pushes the transformed dataset to a database or a third-party BI tool API.
Scenario
Design and implement an API gateway service (using FastAPI) that orchestrates data from multiple downstream microservices, performs real-time data enrichment and aggregation, handles high-throughput requests, and exposes a unified, versioned API for front-end clients.
`pandas` is the industry standard for tabular data. `NumPy` handles efficient numerical computation. `Polars` is a high-performance DataFrame library. `Dask` extends `pandas` for parallel and out-of-core computation on larger-than-memory datasets.
`FastAPI` is the modern standard for building high-performance APIs with automatic docs. `Flask` is a minimalist framework. `Requests` is for synchronous HTTP calls; `httpx` supports async. `Pydantic` enforces data validation and settings management.
`SQLAlchemy` is the ORM for database interaction. `Airflow`/`Prefect` are workflow orchestrators for scheduling complex, multi-step data pipelines. `Redis` is used for caching and task queuing.
Answer Strategy
Test understanding of pandas memory limitations and scalable solutions. Answer should outline: 1) Using chunking with `pd.read_csv(..., chunksize=N)` and processing in batches. 2) Considering `dask.dataframe` for out-of-memory computation. 3) Performing a database join if data is stored in SQL. 4) Pre-filtering and selecting only necessary columns before the merge.
Answer Strategy
Tests problem-solving and system design thinking. The answer must include: 1) Checking server logs and application logs for stack traces and error types (OOM, timeout). 2) Verifying file size limits and timeout settings in the web server (Nginx) and framework (FastAPI). 3) Profiling the memory usage of the data processing function. 4) Implementing a solution like streaming the file processing, increasing resource limits, or offloading the job to a task queue (Celery).
1 career found
Try a different search term.