SRE Operations Engineer responsible for monitoring, triaging, and executing standard operational tasks across enterprise applications. Supports Kubernetes, APIs, WAF, databases, API gateways, Kafka, and multi-cloud environments. First line of defense for incident detection, troubleshooting, and escalation using runbooks and automation.
Key Highlights
Key Responsibilities
Technical Skills Required
Benefits & Perks
Nice to Have
Job Description
Title: SRE Operations Engineer (Canada)
Location: 100% Remote
Role Summary
- L1 Site Reliability Engineer responsible for monitoring, triaging, and executing standard operational tasks across enterprise applications
- Supports Kubernetes, APIs, WAF, databases, API gateways (Gloo, Apigee), Kafka, and multi-cloud environments (AWS/Azure/GCP)
- First line of defense for incident detection, troubleshooting, and escalation using runbooks and automation
Key Responsibilities
- Monitoring & Infrastructure
- Monitor systems using Grafana, Datadog, Splunk, Prometheus, and AIOps tools
- Detect anomalies and follow alert workflows for resolution or escalation
- Validate Kubernetes issues using monitoring dashboards and logs
- Runbook Execution
- Follow predefined runbooks for incident resolution
- Restart services, validate system health, and escalate when procedures fail
- Ensure adherence to operational standards
- Incident Triage & Communication
- Perform initial incident triage and severity classification
- Collect logs, metrics, and system data for analysis
- Communicate clearly with stakeholders and escalation teams
- Kubernetes Operations
- Use kubectl to inspect pods, deployments, and services
- Validate service health and troubleshoot cluster-level issues
- Scripting & Automation
- Read and modify scripts in Python, Bash, or PowerShell
- Support automation of repetitive operational tasks
- Networking & Security Troubleshooting
- Use tools like ping, curl, netstat, and traceroute
- Identify DNS, firewall, WAF, or proxy-related issues
- Documentation & Knowledge Management
- Document incident resolution steps and system issues
- Identify gaps in runbooks and suggest improvements
Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.
Preferred Skills
- Familiarity with AWS, Azure, or GCP cloud platforms
- Basic SQL/NoSQL knowledge (e.g., simple query validation like SELECT 1)
- Experience with ITSM tools such as ServiceNow, Jira, or xMatters
- Exposure to observability tools (ELK, Prometheus, Grafana, Splunk)
- Understanding of AI-assisted operational support tools
- Strong automation mindset and process optimization awareness
Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.
Qualifications
- 2–5 years (or more) in IT operations, NOC, or SRE/DevOps roles
- Strong understanding of Linux, networking, and Kubernetes fundamentals
- Knowledge of cloud-ready applications and observability tools
- Strong troubleshooting skills using structured methods (5 Whys, Fishbone analysis)
Deliverables
- Continuous monitoring of infrastructure, applications, dashboards, and logs
- Execution of standardized runbooks for incidents and routine tasks
- First-level incident triage and escalation to L2/L3 teams
- Documentation of incidents, gaps, and automation opportunities
- Clear communication during operational incidents
- Support onboarding of applications into operations framework
Similar Jobs
Explore other opportunities that match your interests
Technical Founder-Figure for Ultra-Luxury Travel Platform
Remote People
ALOIS Solutions