Position Overview
Become a key player as a L1 Site Reliability Engineer, focusing on operational tasks across enterprise applications. Your expertise in Kubernetes, APIs, and multi-cloud environments is essential for incident management and resolution.
In this role, you will handle monitoring, triaging, and executing crucial tasks using advanced tools like Grafana and Datadog. With 2-5 years in IT operations or DevOps, you’ll support automation and improve incident response processes while ensuring systems are healthy and operational standards are met.
Key Responsibilities: • Monitor systems with Grafana and Datadog for anomalies • Execute predefined runbooks for incident resolution • Collect logs and system data for analysis • Troubleshoot issues using kubectl and automation scripts • Document incident resolution steps and improvements
Requirements: • 2-5 years in IT operations or SRE roles • Proficient in Linux and Kubernetes fundamentals • Familiarity with AWS, Azure, or GC...