AI Infrastructure Engineer
AI Infrastructure Engineers design, build, and maintain the foundational systems that power machine learning workloads at scale - …
Skill Guide
A set of specialized networking technologies and protocols designed for ultra-low-latency, high-bandwidth communication in high-performance computing (HPC) and distributed AI training clusters, enabling direct memory access between servers to bypass CPU bottlenecks.
Scenario
Establish basic connectivity and measure fundamental latency/throughput between two servers equipped with InfiniBand HCAs.
Scenario
Profile and optimize the performance of a multi-GPU, multi-node training job for a model like ResNet-50 on a small cluster (e.g., 4 nodes, 4 GPUs each).
Scenario
Architect the network for a new 128-GPU AI training cluster (8 nodes, 16 GPUs/node) with dual-rail 400G NDR InfiniBand, optimizing for large LLM training jobs (e.g., >100B parameters).
OFED is the essential driver stack for InfiniBand/RoCE. NCCL is the standard library for multi-GPU collective operations. libibverbs is the userspace API for RDMA programming. Diagnostic tools are used for fabric management and troubleshooting. UFM provides enterprise-grade fabric monitoring, provisioning, and optimization.
Understanding the physical layer (HCA generations, link speeds) and protocol variants (IB vs. RoCE) is foundational. GPUDirect technologies enable direct data transfer between GPU memory and network/storage, bypassing the CPU and system memory for maximum throughput and minimal latency.
1 career found
Try a different search term.