Position Overview
NVIDIA is seeking System Administrator/DevOps Engineers to help build and operate a global Service Reliability Operations Center supporting our Hardware Infrastructure. This role focuses on ensuring scalability, resilience, and near 100% availability. As part of this team, you will collaborate with SRE, Security, and DevOps to improve reliability, reduce incident frequency and impact, and drive rapid resolution when issues occur. You will partner with development teams to implement monitoring, alerting, and observability solutions that proactively detect issues and enhance the customer experience. You will also help evaluate and select the tools and technologies used to monitor, operate, and measure the effectiveness of our production environments.
What you will be doing:
+ Operate in a 24/7 follow-the-sun support model spanning multiple continents, with direct reporting to a U.S.-based manager
+ Work a 4-day, 10-hour schedule, including either Saturday or Sunday,...