Senior Site Reliability Engineer (10+)
sambanova | 1 days ago | Bengaluru

This individual will be responsible for

  • Assume full-stack ownership for the successful delivery of our SambaNova services in a hybrid model, including, but not limited to, deployment, configuration, integrations, observability, and ongoing operations
  • Develop deep understanding of the end-to-end configurations, dependencies, customer requirements, and overall characteristics of the production services as the accountable owner for overall service operations
  • Systems and application administration for multiple customer facing production environments (hosted and on-premise), with a continued focus on improving efficiencies, availability, and supportability through automation and well defined run-books
  • Partner and collaborate with product and engineering teams to recommend and implement improvements to the security, resilience, and operational readiness of our systems, with the flexibility to integrate into unique customer environments
  • Augment ongoing efforts to design and develop automation for deployments, updates and upgrades of the entire SambaNova software stack
  • Lead efforts to triage, debug, and fix issues related to networks, storage, operating systems, containers, and applications to drive proactive and reactive incident resolution and root cause analysis
  • Build the systems and tools for centralized command and control of distributed environments
  • Participate in on-call rotation responsibilities 
     

Basic qualifications 

  • Bachelors and/or Masters in CS or related field 
  • 10+ years of hands-on experience in SRE / Production engineering roles with focus on supporting, scaling and ensuring the reliability of large-scale production services and infrastructure
  • Extensive experience in deploying, securing, managing, and operating Linux systems in globally distributed production environments
  • Good knowledge of containers with hands-on experience in deploying, managing, and troubleshooting Kubernetes clusters and components in private data centers as well as public cloud
  • Proficient with at least one modern programming language (Python preferred) and the willingness to learn new languages as required
  • A systematic problem-solving approach to troubleshooting and the desire to solve the root cause of common problems in 24x7 environments
Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.