Site Reliability Engineer (5+)

globallogic | 191 days ago | Bangalore

Required Skills & Qualifications

8+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
Strong programming and scripting skills in Python, Go, Bash (or similar), with a focus on automation and tooling.
Expertise in CI/CD pipelines (Jenkins or similar) and infrastructure-as-code (Terraform, CloudFormation).
Hands-on experience with AWS services (EC2, RDS, S3, VPC, IAM, CloudWatch, etc.) for infrastructure design and operations.
Proficiency in Prometheus (or other monitoring/alerting systems) and incident management practices.
Solid understanding of system design, distributed systems, and large-scale architecture.
Strong background in capacity planning, performance tuning, and load testing.
Excellent problem-solving, communication, and collaboration skills.

Job responsibilities

Key Responsibilities

System Design & Architecture
Design, build, and maintain scalable, resilient, and highly available infrastructure and services for our’s advertising platform.
Collaborate with engineering teams to ensure new products and features are built with reliability, scalability, and performance in mind.
Implement redundancy, failover strategies, and automated recovery mechanisms to minimize downtime and enhance service reliability.
Leverage AWS services (e.g., EC2, RDS, S3, Lambda, VPC, IAM) to design and optimize infrastructure.
Automation & Tooling
Develop automation frameworks and tools to improve CI/CD pipelines, infrastructure provisioning, and operational workflows.
Leverage strong programming and scripting skills (Python, Go, Bash) to build scalable automation solutions, reducing manual intervention.
Drive initiatives for end-to-end automation, optimizing efficiency and reducing human error.
Monitoring & Incident Management
Implement and maintain robust monitoring systems (e.g., Prometheus, Grafana) with real-time alerting on key system metrics (latency, availability, etc.).
Lead incident response, troubleshooting, and root cause analysis, ensuring learnings are captured through post-mortem reviews.
Collaborate with support and engineering teams to reduce MTTR (Mean Time to Recovery) and prevent recurring issues.
Performance Optimization & Capacity Planning
Analyze system performance and recommend improvements for latency, throughput, and cost optimization.
Conduct capacity planning and load testing to ensure infrastructure can handle growth and peak traffic demands.
Identify and eliminate bottlenecks to improve reliability and efficiency.
Collaboration & Knowledge Sharing
Work closely with engineers, product managers, and stakeholders to align system reliability with business goals.
Document best practices, system designs, and incident response procedures to improve team efficiency and knowledge sharing.
Mentor and provide technical guidance to junior engineers, promoting a culture of continuous learning and improvement.

Official notification

Join our Telegram group for daily job update

⚡ Hot Jobs Trending Now

SRE

Sr. SRE Engineer

Stripe | Bangalore, India

DEV

Backend Developer

Coinbase | Remote, India

Infra

Cloud Infra Lead

Datadog | Pune, India

MLOps Architect

Anthropic | Hyderabad

Data

Fivetran Data Eng.

Fivetran | Mumbai

SRE

Sr. SRE Engineer

Stripe | Bangalore, India

DEV

Backend Developer

Coinbase | Remote, India

Infra

Cloud Infra Lead

Datadog | Pune, India

MLOps Architect

Anthropic | Hyderabad

Data

Fivetran Data Eng.

Fivetran | Mumbai

SDE

Staff Software Eng.

Airbnb | Gurgaon, India

Prod

Platform Engineer

Databricks | Bangalore

Quality Assurance

GitLab | Remote

Security

Cloud Security

Zscaler | Mumbai

Product Designer

Figma | Pune, India

SDE