What You’ll Do
* Automation First: Identify repetitive manual work and design automation frameworks, self-service tooling, and auto-healing systems.
* Observability & Monitoring: Build end-to-end monitoring, logging, and alerting systems to ensure visibility and proactive issue resolution.
* Incident Response: Lead complex incident troubleshooting, root cause analysis, and drive blameless postmortems.
* CI/CD & Infrastructure: Enhance CI/CD pipelines and use Infrastructure as Code (IaC) to provision, configure, and manage cloud resources.
* Collaboration: Partner with dev teams to embed reliability into design and development not just after deployment.
* Innovation: Continuously evaluate emerging tools and technologies, keeping the stack modern and efficient.
* Participate in on-call rotation and improve processes to minimize human intervention.
What We’re Looking For
* 6–9 years of hands-on experience as an SRE Engineer.
* Strong expertise in at least one major cloud platform (AWS, Azure, or GCP).
* Deep knowledge of Linux/Unix systems, networking, and distributed systems.
* Proficiency in programming/scripting (Python, Go, or similar).
* Advanced skills with containers and orchestration (Docker, Kubernetes at scale). * Proven experience with CI/CD pipelines and Infrastructure as Code (Terraform, Ansible, Helm, etc.).
* Expertise with observability platforms (Prometheus, Grafana, ELK, Datadog, Splunk).
* Strong background in incident management, disaster recovery, and capacity planning.
* Familiarity with SRE practices (SLIs, SLOs, error budgets, blameless postmortems).
* Excellent problem-solving, debugging, and performance optimization skills.
Official notificationAny question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.