As a Senior Cloud Ops Engineer, you will play a crucial role in managing and maintaining our cloud/on Prem setups ensuring reliability, security, performance and optimizing scalability for our clients. You will work closely with cross-functional teams to deploy, monitor, and optimize our containerized applications and services, all while adhering to best practices and industry standards.
Key Responsibilities:
Azure AKS Management:
- Provision, configure, and manage Azure AKS clusters to support our containerized applications.
- Monitor cluster performance, scaling, and health, and adjust as needed.
- Administration in Network, MFA, SSO, Storage, NSG, Traffic Manager, ACR etc.
- Knowledge in Gloomesh, istio, Oracle cloud, AWS is a big plus.
- Upgrades and patching.
- Experience in Cloud migration
Openshift Cloud Management:
- Experience in Openshift container platform
On-premises/Hybrid:
- Experience with virtualization technologies like VMware or Hyper-V
- Provide hands-on support for the deployment, configuration, and maintenance of on-premises infrastructure, including servers, networking equipment and storage systems.
- Experience in micro segmentation (network), VPN tunnel, ADDS, DMZ, firewall
- LINUX/UNIX administration
- Maintenance/patching
Infrastructure as Code (IaC):
- Implement and maintain infrastructure as code (IaC) using tools like Pulumi, Terraform or ARM templates to automate AKS provisioning and configuration.
- Knowledge of scripting languages such as Typescript, Python, Bash, or PowerShell for automation tasks.
Container Orchestration:
- Manage container orchestration using Kubernetes, including pod deployment, scaling, and network configurations.
- Experience in templating and deploying HELM packages.
- Troubleshoot and resolve issues related to containers and container orchestration.
- Experience working on MongoDB, Redis, Timescale, Influx etc.
Monitoring and Logging:
- Implement monitoring and alerting solutions for AKS clusters using Azure Monitor, Prometheus, Grafana, Tempo or similar tools.
- Configure and maintain centralized log management systems to ensure visibility into application performance.
- Experience in any SIEM tool (Azure Sentinel, Splunk etc.)
- Participate in on-call rotation and provide timely resolution to production incidents.
Security:
- Implement security best practices and policies for AKS, including network policies, RBAC, and container image scanning.
- Ensure compliance with industry standards and best security practices.
CI/CD Integration:
- Integrate Azure DevOps (CI/CD pipeline) for automated application deployment and updates.
- Experience in ARGO CD/Flux
- Experience in GitOps
Backup and Disaster Recovery:
- Develop and maintain backup and disaster recovery plans for AKS clusters to ensure data integrity and high availability.
- Experience in concepts of DR/regional pairing.
Performance Optimization:
- Continuously optimize AKS resources to ensure efficient resource utilization, cost-effectiveness, and high-performance applications.
Documentation:
- Create and maintain detailed documentation of AKS configurations, processes, and best practices.
Official notification