Site Reliability Engineer (7+)

adobe | 95 days ago | Bangalore

Key Responsibilities:

System Design & Architecture

Design, build, and maintain scalable, highly available infrastructure and services for Adobe Pass platform.

Collaborate with engineering teams to ensure new products and features are designed with reliability and scalability in mind.

Create resilient architectures that prevent downtime and enhance service reliability through redundancy, failover strategies, and automated recovery mechanisms.

Automation & Tooling

Develop automation frameworks for continuous integration/continuous deployment (CI/CD) pipelines, infrastructure provisioning, and operational tasks.

Build tools to monitor system performance, reliability, and capacity, reducing manual interventions and operational overhead.

Drive initiatives for end-to-end automation, optimizing for efficiency and reducing human error.

Monitoring & Incident Management

Implement and maintain robust monitoring systems that detect anomalies and provide real-time alerting on key system metrics (latency, availability, etc.).

Lead incident management processes, including troubleshooting, root cause analysis, and post-mortem reviews to prevent future occurrences.

Collaborate with support and engineering teams to develop strategies for minimizing incidents and reducing mean time to recovery (MTTR).

Performance Optimization & Capacity Planning:

Analyze system performance and make recommendations for improvement, focusing on latency reduction, increased throughput, and cost efficiency.

Conduct capacity planning to ensure the infrastructure can scale efficiently to meet the growing demands of Adobe’s advertising platform.

Perform load testing and simulate peak traffic scenarios to identify potential bottlenecks.

Collaboration & Knowledge Sharing:

Partner with software engineers, product managers, and other stakeholders to understand business requirements and ensure system reliability meets the platform’s needs.

Document best practices, system designs, and incident response procedures, promoting knowledge sharing within the team.

Mentor and provide technical leadership to junior engineers, fostering a culture of continuous learning and improvement.

Qualifications:

Bachelor's or Master’s degree in Computer Science, Engineering, or a related field.
7+ years of experience in site reliability engineering, infrastructure engineering, or a similar role.
Proven experience in managing large-scale distributed systems, preferably in cloud environments such as AWS, Azure, or GCP.
Strong programming and scripting skills (e.g., Python, Go, Bash) with a focus on automation.
Deep understanding of containerization and orchestration technologies (Docker, Kubernetes, etc.).
Expertise in monitoring tools (Prometheus, Grafana, Datadog) and incident management practices.
Experience with CI/CD pipelines, infrastructure as code (Terraform, CloudFormation), and version control (Git).
Solid knowledge of networking, storage, and database systems, both relational and NoSQL.
Excellent problem-solving, troubleshooting, and analytical skills.

Official notification

Join our Telegram group for daily job update

⚡ Hot Jobs Trending Now

SRE

Sr. SRE Engineer

Stripe | Bangalore, India

DEV

Backend Developer

Coinbase | Remote, India

Infra

Cloud Infra Lead

Datadog | Pune, India

MLOps Architect

Anthropic | Hyderabad

Data

Fivetran Data Eng.

Fivetran | Mumbai

SRE

Sr. SRE Engineer

Stripe | Bangalore, India

DEV

Backend Developer

Coinbase | Remote, India

Infra

Cloud Infra Lead

Datadog | Pune, India

MLOps Architect

Anthropic | Hyderabad

Data

Fivetran Data Eng.

Fivetran | Mumbai

SDE

Staff Software Eng.

Airbnb | Gurgaon, India

Prod

Platform Engineer

Databricks | Bangalore

Quality Assurance

GitLab | Remote

Security

Cloud Security

Zscaler | Mumbai

Product Designer

Figma | Pune, India

SDE