Lead sophisticated solve and resolution of network incidents spanning LAN, WAN, VPN, SD-WAN, data centers, and cloud networks (AWS, Azure, GCP).
Drive adoption and integration of AI Ops tools (e.g., Dynatrace, LogicMonitor) to enable proactive anomaly detection, alert correlation, and incident automation.
Work with engineering and platform teams to expand observability coverage, tune alerting thresholds, and onboard new network services to SRC monitoring.
Perform deep-dive root cause analyses (RCAs), lead incident reviews, and implement preventive actions to improve service resilience.
Design and build dashboards, reliability reports, and KPIs (MTTR, latency, packet loss, availability) to improve visibility and decision-making.
Contribute to network automation initiatives using tools like Ansible and Terraform; develop and maintain intelligent playbooks for remediation workflows.
Tune and optimize AI/ML models used in telemetry analysis and predictive incident detection.
Work on a shift pattern, on a 24/7/365 operating model, while being able to work independently and flexibly in response to emergencies or critical issues
Certifications such as Cisco CCNA/CCNP, CompTIA Network+, or equivalent.
In addition, the Cisco DevNet Certification would be highly advantageous.
Hands-on experience with network technologies and protocols (TCP/IP, BGP, OSPF, DNS, DHCP, SDWAN).
Experience with public cloud networking (AWS, Azure, GCP).
Familiarity with ITIL and SRE principles (SLI/SLOs, error budgets, incident command).
Experience integrating AI Ops tools with ITSM systems (e.g., ServiceNow, Jira Service Management).
Exposure to automation/orchestration tools (Ansible and Terraform).
Required Skills and Experience:
3–6 years of hands-on experience in Platform Operations, or Infrastructure Support roles.
Good experience with observability tools (e.g., Dynatrace, Logic Monitor, Datadog, Splunk) for real-time monitoring, alerting, and diagnostics.
Proficiency in a scripting or programming languages (e.g., Python, Java, .NET, Node.js, Ansible or JavaScript).
Practical knowledge of infrastructure automation using Ansible, including writing playbooks.
Proficient in ticket management via an ITSM platform such as ServiceNow.
Experience leading incident response, driving service restoration and coordinating root cause analysis.
Effective communicator within a team with a proactive approach and personal accountability for outcomes.
Ability to analyze incident patterns and metrics to proactively recommend reliability improvements.