Network Service Reliability Advisor (3+)
arm | 11 days ago | Bengaluru

Responsibilities:

  • Lead sophisticated solve and resolution of network incidents spanning LAN, WAN, VPN, SD-WAN, data centers, and cloud networks (AWS, Azure, GCP).
  • Drive adoption and integration of AI Ops tools (e.g., Dynatrace, LogicMonitor) to enable proactive anomaly detection, alert correlation, and incident automation.
  • Work with engineering and platform teams to expand observability coverage, tune alerting thresholds, and onboard new network services to SRC monitoring.
  • Perform deep-dive root cause analyses (RCAs), lead incident reviews, and implement preventive actions to improve service resilience.
  • Design and build dashboards, reliability reports, and KPIs (MTTR, latency, packet loss, availability) to improve visibility and decision-making.
  • Contribute to network automation initiatives using tools like Ansible and Terraform; develop and maintain intelligent playbooks for remediation workflows.
  • Tune and optimize AI/ML models used in telemetry analysis and predictive incident detection.
  • Work on a shift pattern, on a 24/7/365 operating model, while being able to work independently and flexibly in response to emergencies or critical issues
  • Certifications such as Cisco CCNA/CCNP, CompTIA Network+, or equivalent.
  • In addition, the Cisco DevNet Certification would be highly advantageous.
  • Hands-on experience with network technologies and protocols (TCP/IP, BGP, OSPF, DNS, DHCP, SDWAN).
  • Experience with public cloud networking (AWS, Azure, GCP).
  • Familiarity with ITIL and SRE principles (SLI/SLOs, error budgets, incident command).
  • Experience integrating AI Ops tools with ITSM systems (e.g., ServiceNow, Jira Service Management).
  • Exposure to automation/orchestration tools (Ansible and Terraform).

Required Skills and Experience:

  • 3–6 years of hands-on experience in Platform Operations, or Infrastructure Support roles.
  • Good experience with observability tools (e.g., Dynatrace, Logic Monitor, Datadog, Splunk) for real-time monitoring, alerting, and diagnostics.
  • Proficiency in a scripting or programming languages (e.g., Python, Java, .NET, Node.js, Ansible or JavaScript).
  • Practical knowledge of infrastructure automation using Ansible, including writing playbooks.
  • Proficient in ticket management via an ITSM platform such as ServiceNow.
  • Experience leading incident response, driving service restoration and coordinating root cause analysis.
  • Effective communicator within a team with a proactive approach and personal accountability for outcomes.
    • Ability to analyze incident patterns and metrics to proactively recommend reliability improvements.
Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.