Required Skills & Qualifications
8+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering.
Strong programming and scripting skills in Python, Go, Bash (or similar), with a focus on automation and tooling.
Expertise in CI/CD pipelines (Jenkins or similar) and infrastructure-as-code (Terraform, CloudFormation).
Hands-on experience with AWS services (EC2, RDS, S3, VPC, IAM, CloudWatch, etc.) for infrastructure design and operations.
Proficiency in Prometheus (or other monitoring/alerting systems) and incident management practices.
Solid understanding of system design, distributed systems, and large-scale architecture.
Strong background in capacity planning, performance tuning, and load testing.
Excellent problem-solving, communication, and collaboration skills.
Job responsibilities
Key Responsibilities
System Design & Architecture
Design, build, and maintain scalable, resilient, and highly available infrastructure and services for our’s advertising platform.
Collaborate with engineering teams to ensure new products and features are built with reliability, scalability, and performance in mind.
Implement redundancy, failover strategies, and automated recovery mechanisms to minimize downtime and enhance service reliability.
Leverage AWS services (e.g., EC2, RDS, S3, Lambda, VPC, IAM) to design and optimize infrastructure.
Automation & Tooling
Develop automation frameworks and tools to improve CI/CD pipelines, infrastructure provisioning, and operational workflows.
Leverage strong programming and scripting skills (Python, Go, Bash) to build scalable automation solutions, reducing manual intervention.
Drive initiatives for end-to-end automation, optimizing efficiency and reducing human error.
Monitoring & Incident Management
Implement and maintain robust monitoring systems (e.g., Prometheus, Grafana) with real-time alerting on key system metrics (latency, availability, etc.).
Lead incident response, troubleshooting, and root cause analysis, ensuring learnings are captured through post-mortem reviews.
Collaborate with support and engineering teams to reduce MTTR (Mean Time to Recovery) and prevent recurring issues.
Performance Optimization & Capacity Planning
Analyze system performance and recommend improvements for latency, throughput, and cost optimization.
Conduct capacity planning and load testing to ensure infrastructure can handle growth and peak traffic demands.
Identify and eliminate bottlenecks to improve reliability and efficiency.
Collaboration & Knowledge Sharing
Work closely with engineers, product managers, and stakeholders to align system reliability with business goals.
Document best practices, system designs, and incident response procedures to improve team efficiency and knowledge sharing.
Mentor and provide technical guidance to junior engineers, promoting a culture of continuous learning and improvement.
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.