Site Reliability Engineer (3+)

IBM | 607 days ago | Hyderabad

Your Role and Responsibilities
Looking for 3+ years of experience candidate with the following experience

Your Role and Responsibilities

Monitoring the health of the IKS control plane and ensuring reliable operations
Responding promptly to production issues and alerts
Executing changes in the production environment through advanced automation
Partnering with other SRE teams and program managers to deliver mission-critical services
Supporting the development and enhancement of Platform-as-a-Service services
Implementing and automating solutions that support IBM Cloud products
Ensuring compliance and security integrity of the environment
Collaborating with Engineering to troubleshoot and resolve production issues
Providing technical escalation support for other Infrastructure Operations teams

Required Technical and Professional Expertise

Expertise in Kubernetes architecture, including the latest features and security aspects
Strong debugging skills in Kubernetes environments.
Strong experience in programming with Python or Go, with demonstrated ability to develop and maintain complex codebases.
Proficiency in network configuration and advanced monitoring solutions such as Prometheus, SysDIG, and Grafana
Experience in hands-on administration of cloud infrastructure, particularly Kubernetes-based platforms.
Skills in performance tuning and optimization of Kubernetes clusters, including resource quota management, scaling, and efficient use of underlying infrastructure.
Understanding of network protocols (TCP/IP, HTTP, etc.) and network configuration tools (e.g., CNI) specific to Kubernetes environments.
Deep understanding of Kubernetes security practices, including network policies, security contexts, role-based access control (RBAC), and the secure handling of secrets.
Knowledge of automation and configuration management tools: Ansible, Salt, Chef,Terraform
Strong Linux skills for managing services across a microservices platform
Ability to implement robust incident management strategies and frameworks
Experience in performance optimization of Kubernetes clusters
Understanding of disaster recovery planning and high availability setups in Kubernetes environments
Excellent written and verbal communication skills, with a willingness to take on call-out responsibilities
Experience establishing and improving procedures within a mission-critical environment

Preferred Technical and Professional Expertise

Hands-on experience with any one of cloud infrastructures (IKS, AWS, Azure, GCP) and integrating cloud services for storage, security, and databases
Knowledge of Slack bot automations for infra/cloud maintenance and SRE-based automations
Active participation in Kubernetes communities and forums
Vendor management skills to ensure optimal service levels and cost control
Ability to mentor and train teams on Kubernetes best practices and operational strategies

Official notification

Join our Telegram group for daily job update

⚡ Hot Jobs Trending Now

SRE

Sr. SRE Engineer

Stripe | Bangalore, India

DEV

Backend Developer

Coinbase | Remote, India

Infra

Cloud Infra Lead

Datadog | Pune, India

MLOps Architect

Anthropic | Hyderabad

Data

Fivetran Data Eng.

Fivetran | Mumbai

SRE

Sr. SRE Engineer

Stripe | Bangalore, India

DEV

Backend Developer

Coinbase | Remote, India

Infra

Cloud Infra Lead

Datadog | Pune, India

MLOps Architect

Anthropic | Hyderabad

Data

Fivetran Data Eng.

Fivetran | Mumbai

SDE

Staff Software Eng.

Airbnb | Gurgaon, India

Prod

Platform Engineer

Databricks | Bangalore

Quality Assurance

GitLab | Remote

Security

Cloud Security

Zscaler | Mumbai

Product Designer

Figma | Pune, India

SDE