Position Overview: As the SRE Lead, you will oversee the reliability and performance of our systems, ensuring they meet the high standards our customers expect. You will lead a team of skilled engineers, guiding them in implementing best practices for reliability, automation, and operational excellence. This role requires a blend of technical expertise, leadership skills, and a strong commitment to continuous improvement.
Key Responsibilities:
Team Leadership: Manage and mentor a team of Site Reliability Engineers, fostering a culture of collaboration, innovation, and accountability.
System Reliability: Drive initiatives to improve system reliability, availability, scalability, and performance.
Incident Management: Lead the response to critical incidents, coordinating efforts across teams to ensure swift resolution and minimal customer impact.
Automation: Champion automation efforts to streamline operational processes, reduce manual intervention, and increase efficiency.
Monitoring and Alerting: Establish and maintain robust monitoring and alerting systems to proactively identify issues and prevent service disruptions.
Capacity Planning: Collaborate with cross-functional teams to forecast capacity requirements and optimize resource utilization.
Continuous Improvement: Promote a culture of continuous improvement through regular retrospectives, post-incident reviews, and knowledge-sharing sessions.
Documentation: Ensure comprehensive documentation of systems, processes, and procedures to facilitate knowledge transfer and training.
Qualifications:
Technical Expertise: Strong background in Linux/Unix systems administration, networking, and cloud infrastructure (AWS, GCP, Azure).
Leadership Skills: Proven experience leading and developing high-performing engineering teams.
Problem Solving: Ability to troubleshoot complex issues, prioritize tasks, and make data-driven decisions under pressure.
Automation Tools: Proficiency in automation tools and configuration management (e.g., Terraform, Ansible, Chef, Puppet).
Monitoring and Logging: Experience with monitoring tools (e.g., Prometheus, Grafana, ELK stack) and log management solutions.
CI/CD: Familiarity with CI/CD pipelines and practices.
Communication: Excellent communication skills with the ability to articulate technical concepts to non-technical stakeholders.
Education and Experience:
Bachelors degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
10+ years of experience in a Site Reliability Engineering or similar role.
5+ years of experience in a leadership or managerial position.
Qualification
Position Overview: As the SRE Lead, you will oversee the reliability and performance of our systems, ensuring they meet the high standards our customers expect. You will lead a team of skilled engineers, guiding them in implementing best practices for reliability, automation, and operational excellence. This role requires a blend of technical expertise, leadership skills, and a strong commitment to continuous improvement.
Key Responsibilities:
Team Leadership: Manage and mentor a team of Site Reliability Engineers, fostering a culture of collaboration, innovation, and accountability.
System Reliability: Drive initiatives to improve system reliability, availability, scalability, and performance.
Incident Management: Lead the response to critical incidents, coordinating efforts across teams to ensure swift resolution and minimal customer impact.
Automation: Champion automation efforts to streamline operational processes, reduce manual intervention, and increase efficiency.
Monitoring and Alerting: Establish and maintain robust monitoring and alerting systems to proactively identify issues and prevent service disruptions.
Capacity Planning: Collaborate with cross-functional teams to forecast capacity requirements and optimize resource utilization.
Continuous Improvement: Promote a culture of continuous improvement through regular retrospectives, post-incident reviews, and knowledge-sharing sessions.
Documentation: Ensure comprehensive documentation of systems, processes, and procedures to facilitate knowledge transfer and training.
Qualifications:
Technical Expertise: Strong background in Linux/Unix systems administration, networking, and cloud infrastructure (AWS, GCP, Azure).
Leadership Skills: Proven experience leading and developing high-performing engineering teams.
Problem Solving: Ability to troubleshoot complex issues, prioritize tasks, and make data-driven decisions under pressure.
Automation Tools: Proficiency in automation tools and configuration management (e.g., Terraform, Ansible, Chef, Puppet).
Monitoring and Logging: Experience with monitor Official notification
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.