SRE Operations Engineer

net2source (n2s) • Canada
Remote
Apply
AI Summary

SRE Operations Engineer responsible for monitoring, triaging, and executing standard operational tasks across enterprise applications. Supports Kubernetes, APIs, WAF, databases, API gateways, Kafka, and multi-cloud environments. First line of defense for incident detection, troubleshooting, and escalation using runbooks and automation.

Key Highlights
Monitoring & Infrastructure
Incident Triage & Communication
Kubernetes Operations
Scripting & Automation
Networking & Security Troubleshooting
Key Responsibilities
Monitoring & Infrastructure
Runbook Execution
Incident Triage & Communication
Kubernetes Operations
Scripting & Automation
Networking & Security Troubleshooting
Technical Skills Required
Kubernetes APIs WAF databases API gateways Gloo Apigee Kafka AWS Azure GCP Grafana Datadog Splunk Prometheus AIOps tools Python Bash PowerShell SQL NoSQL ITSM tools ServiceNow Jira xMatters ELK Prometheus Grafana Splunk
Benefits & Perks
100% remote
2–5 years (or more) in IT operations, NOC, or SRE/DevOps roles
Strong understanding of Linux, networking, and Kubernetes fundamentals
Knowledge of cloud-ready applications and observability tools
Strong troubleshooting skills using structured methods
Nice to Have
Familiarity with AWS, Azure, or GCP cloud platforms
Basic SQL/NoSQL knowledge
Experience with ITSM tools such as ServiceNow, Jira, or xMatters
Exposure to observability tools (ELK, Prometheus, Grafana, Splunk)

Job Description


Title: SRE Operations Engineer (Canada)

Location: 100% Remote


Role Summary

  • L1 Site Reliability Engineer responsible for monitoring, triaging, and executing standard operational tasks across enterprise applications
  • Supports Kubernetes, APIs, WAF, databases, API gateways (Gloo, Apigee), Kafka, and multi-cloud environments (AWS/Azure/GCP)
  • First line of defense for incident detection, troubleshooting, and escalation using runbooks and automation


Key Responsibilities

  • Monitoring & Infrastructure
  • Monitor systems using Grafana, Datadog, Splunk, Prometheus, and AIOps tools
  • Detect anomalies and follow alert workflows for resolution or escalation
  • Validate Kubernetes issues using monitoring dashboards and logs
  • Runbook Execution
  • Follow predefined runbooks for incident resolution
  • Restart services, validate system health, and escalate when procedures fail
  • Ensure adherence to operational standards
  • Incident Triage & Communication
  • Perform initial incident triage and severity classification
  • Collect logs, metrics, and system data for analysis
  • Communicate clearly with stakeholders and escalation teams
  • Kubernetes Operations
  • Use kubectl to inspect pods, deployments, and services
  • Validate service health and troubleshoot cluster-level issues
  • Scripting & Automation
  • Read and modify scripts in Python, Bash, or PowerShell
  • Support automation of repetitive operational tasks
  • Networking & Security Troubleshooting
  • Use tools like ping, curl, netstat, and traceroute
  • Identify DNS, firewall, WAF, or proxy-related issues
  • Documentation & Knowledge Management
  • Document incident resolution steps and system issues
  • Identify gaps in runbooks and suggest improvements


Preferred Skills

  • Familiarity with AWS, Azure, or GCP cloud platforms
  • Basic SQL/NoSQL knowledge (e.g., simple query validation like SELECT 1)
  • Experience with ITSM tools such as ServiceNow, Jira, or xMatters
  • Exposure to observability tools (ELK, Prometheus, Grafana, Splunk)
  • Understanding of AI-assisted operational support tools
  • Strong automation mindset and process optimization awareness


Qualifications

  • 2–5 years (or more) in IT operations, NOC, or SRE/DevOps roles
  • Strong understanding of Linux, networking, and Kubernetes fundamentals
  • Knowledge of cloud-ready applications and observability tools
  • Strong troubleshooting skills using structured methods (5 Whys, Fishbone analysis)


Deliverables

  • Continuous monitoring of infrastructure, applications, dashboards, and logs
  • Execution of standardized runbooks for incidents and routine tasks
  • First-level incident triage and escalation to L2/L3 teams
  • Documentation of incidents, gaps, and automation opportunities
  • Clear communication during operational incidents
  • Support onboarding of applications into operations framework


Similar Jobs

Explore other opportunities that match your interests

Technical Founder-Figure for Ultra-Luxury Travel Platform

Devops
•
5h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Remote People

Canada
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

ALOIS Solutions

Canada
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

Cognizant

Canada

Subscribe our newsletter

New Things Will Always Update Regularly