Position Overview
**Role Number:** 200619215-3760
**Summary**
Scaling machine learning workloads across thousands of GPUs and TPUs creates challenges that few engineers ever encounter. In Apple’s Machine Learning Platform Technologies organization, we build the infrastructure that powers large-scale ML training and inference workloads, bringing together expertise in distributed systems, machine learning infrastructure, and high-performance computing.
**Description**
As a performance engineer in the ML Compute Efficiency team, you’ll tackle ambiguous systems challenges, identify inefficiencies and build solutions that maximize accelerator utilization, reduce idle and fragmented capacity, and minimize recovery periods. This includes analyzing accelerator performance, digging into various parallelism techniques, and refining workload scheduling and orchestration across the compute fleet.
**Minimum Qualifications**
+ Experience with large-scale distributed systems ...