Site Reliability Engineer (7+)
adobe | 4 days ago | Bangalore

Key Responsibilities:

  • System Design & Architecture

Design, build, and maintain scalable, highly available infrastructure and services for Adobe Pass platform.

Collaborate with engineering teams to ensure new products and features are designed with reliability and scalability in mind.

Create resilient architectures that prevent downtime and enhance service reliability through redundancy, failover strategies, and automated recovery mechanisms.

  • Automation & Tooling

Develop automation frameworks for continuous integration/continuous deployment (CI/CD) pipelines, infrastructure provisioning, and operational tasks.

Build tools to monitor system performance, reliability, and capacity, reducing manual interventions and operational overhead.

Drive initiatives for end-to-end automation, optimizing for efficiency and reducing human error.

 

  • Monitoring & Incident Management

Implement and maintain robust monitoring systems that detect anomalies and provide real-time alerting on key system metrics (latency, availability, etc.).

Lead incident management processes, including troubleshooting, root cause analysis, and post-mortem reviews to prevent future occurrences.

Collaborate with support and engineering teams to develop strategies for minimizing incidents and reducing mean time to recovery (MTTR).

  • Performance Optimization & Capacity Planning:

Analyze system performance and make recommendations for improvement, focusing on latency reduction, increased throughput, and cost efficiency.

Conduct capacity planning to ensure the infrastructure can scale efficiently to meet the growing demands of Adobe’s advertising platform.

Perform load testing and simulate peak traffic scenarios to identify potential bottlenecks.

  • Collaboration & Knowledge Sharing:

Partner with software engineers, product managers, and other stakeholders to understand business requirements and ensure system reliability meets the platform’s needs.

Document best practices, system designs, and incident response procedures, promoting knowledge sharing within the team.

Mentor and provide technical leadership to junior engineers, fostering a culture of continuous learning and improvement.

 

Qualifications:

  • Bachelor's or Master’s degree in Computer Science, Engineering, or a related field.
    7+ years of experience in site reliability engineering, infrastructure engineering, or a similar role.
  • Proven experience in managing large-scale distributed systems, preferably in cloud environments such as AWS, Azure, or GCP.
  • Strong programming and scripting skills (e.g., Python, Go, Bash) with a focus on automation.
  • Deep understanding of containerization and orchestration technologies (Docker, Kubernetes, etc.).
  • Expertise in monitoring tools (Prometheus, Grafana, Datadog) and incident management practices.
  • Experience with CI/CD pipelines, infrastructure as code (Terraform, CloudFormation), and version control (Git).
  • Solid knowledge of networking, storage, and database systems, both relational and NoSQL.
  • Excellent problem-solving, troubleshooting, and analytical skills.
Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.