Position Overview
Troubleshoot production issues across services and levels of the stack Make monitoring and alerting alert on symptoms and not on outages Completing Root Cause Analysis (RCA) investigations Contributions to handbook, runbooks, general documentation Provide critical day-to-day support for a massive scale, distributed system including on call duties/rotations. Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding Performance tuning across levels of stack, infrastructure, kubernetes, databases and code. Ready to participate in 24x7 SRE on-call rotations and escalation workflows as needed, such as an occasional weekend