Key Responsibilities:
Lead Platform Reliability Initiatives: Design and optimize multi-region, highly available cloud architectures using services like container orchestration, compute instances, managed databases, and object storage to achieve SLIs/SLOs and error budgets that exceed 99.99% availability.
Drive Automation and IaC: Build and maintain Infrastructure as Code (IaC) pipelines with tools like CDK, Terraform, or CloudFormation; automate deployments via CI/CD tools and serverless functions to accelerate delivery while minimizing operational overhead.
Reliability, Availability & Resilience: Establish, track and enforce SLIs, SLOs, error budgets. Ensure systems’ availability, latency, and throughput meet targets. Build strategies for redundancy, high availability, multi-AZ / multi-region failover, backups, disaster recovery
Enhance Observability and Monitoring: Implement comprehensive monitoring stacks with cloud-native metrics, open-source monitoring, and visualization tools; define alerting thresholds, conduct root cause analyses (RCAs), and optimize performance for distributed systems including message brokers, caching layers, and relational databases.
Champion Security and Compliance: Enforce cloud best practices for identity and access management, encryption, networking, and policy-as-code with tools like OPA; integrate security into CI/CD pipelines to protect sensitive data in regulated environments.
Innovate on Scalability: Evaluate and implement advanced cloud features like serverless architectures, service meshes, and autoscaling solutions to support growing user demands and reduce latency.
Operational Excellence: Participate and lead incident response for production issues and continuously improve processes to balance feature velocity with system reliability.
Cost & Performance: Monitor and optimize cloud spend, resource usage; rightsizing, discount strategies and waste elimination.
Mentor and Influence: Guide junior engineers through design reviews, incident post-mortems, and adoption of SRE practices; collaborate with stakeholders to shape cloud strategy, cost optimization, and capacity planning for enterprise-scale workloads.
Educational Qualification:
Bachelor's Degree or equivalent in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math)
Technical skills:
15+ years in software engineering, site reliability engineering, or cloud platform roles, with significant exposure to AWS production systems.
Deep hands-on expertise with core cloud services including container orchestration, compute, databases, storage, monitoring, identity management, serverless, and networking.
Expert level skill in Infrastructure as Code: Terraform, CloudFormation, AWS CDK or similar.
Proficiency in programming languages like Python, Go, or Java for automation, scripting, and building tools.
Deep understanding of observability tooling: metrics, logging, distributed tracing, alerting (e.g. CloudWatch, Prometheus, Grafana, ELK, etc.).
Strong experience with incident management: debugging, performance tuning, root cause analysis.
Proven track record of cost optimization in cloud environments.
Security mindset: knowledge of AWS security services, governance, compliance standards.
Proven track record in implementing SRE practices: SLIs/SLOs, error budgets, monitoring/alerting, and incident management.
Strong communication and collaboration abilities to influence without authority and translate technical concepts to non-technical stakeholders
Official notification
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.