AI High-Frequency Trading Analyst
An AI High-Frequency Trading Analyst designs, deploys, and continuously optimizes machine-learning-driven trading systems that exe…
Skill Guide
The engineering discipline of leveraging parallel processing hardware to accelerate the computation-heavy phases of machine learning (training) and to minimize latency and resource consumption during serving (inference), with the explicit goal of achieving scalable, cost-effective, and reliable production systems.
Scenario
You are given a standard PyTorch script for training ResNet-50 on ImageNet. The training is slower than expected and GPU utilization is inconsistent.
Scenario
Convert a PyTorch BERT model to serve high-throughput, low-latency inference for a real-time API. Target a P99 latency under 50ms on a single GPU.
Scenario
You need to fine-tune a 13B-parameter LLM on a cluster of 8 nodes, each with 8x A100 80GB GPUs. The naive data-parallel approach runs out of memory and has poor scaling efficiency.
Used to identify performance bottlenecks at the system, kernel, and operator level. Nsight Systems gives a high-level timeline view; Nsight Compute provides deep kernel-level analysis. Framework profilers are integrated for framework-specific insights.
PyTorch AMP enables automatic mixed-precision. `torch.compile` (TorchDynamo) graph-captures Python code for backend optimizations. Apex provides fused kernels. DeepSpeed and Horovod are for scaling distributed training with optimizations like ZeRO.
TensorRT and ONNX Runtime perform graph optimization and kernel fusion for low-latency inference. Triton and TF-Serving are production-grade model servers that handle model versioning, batching, and concurrent multi-model serving.
Answer Strategy
The interviewer is testing systematic debugging and knowledge of distributed training bottlenecks. The candidate should outline a step-by-step profiling approach. **Sample Answer**: 'First, I'd verify linear scaling on a single GPU by profiling to establish a baseline. Then, I'd use PyTorch profiler with DDP to compare traces between 1 and 8 GPUs, focusing on communication ops (allreduce) and looking for imbalances. Common causes include uneven data partitioning, slow data loading that can't keep up, or overhead from synchronizing small tensors. I'd check if `find_unused_parameters` is a bottleneck and ensure the model's computation graph is static. Finally, I'd benchmark the interconnect bandwidth (e.g., NCCL tests) to rule out network issues.'
Answer Strategy
This tests practical cost-optimization trade-offs. The candidate should mention a multi-pronged approach. **Sample Answer**: 'My strategy focuses on reducing compute per request and improving hardware utilization. First, I'd profile and apply TensorRT optimization to fuse layers and use lower precision (FP16/INT8). Second, I'd implement dynamic batching in Triton to increase GPU throughput. Third, I'd evaluate model pruning or a smaller, distilled model that meets accuracy requirements. Finally, I'd conduct a cost-performance analysis across different GPU types (e.g., T4 vs. A10G) to select the most cost-effective hardware for the optimized workload.'
1 career found
Try a different search term.