We are seeking a Senior Embedded Site Reliability Engineer to improve observability, reliability, and user experience monitoring for our ecommerce and mobile platforms. This role involves collaborating with development teams, designing monitoring strategies, and ensuring critical customer journeys are reliable and performant. The ideal candidate will have strong hands-on experience with Kubernetes-based environments and observability solutions.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Type: Contract (12+ months)
Location: Fully Remote (PST core hours)
Benefits: Standard iMatch or People2.0 benefits available if needed
About the Role:
XXXX is evolving its Site Reliability Engineering practice to better support customer-facing digital experiences across our ecommerce and mobile platforms. We are seeking an embedded Site Reliability Engineer (SRE) who will work closely with assigned development teams to improve observability, reliability, and user experience monitoring.
This role is ideal for an SRE who enjoys collaborating directly with developers, shaping monitoring strategies, and ensuring that critical customer journeys—such as flight search, booking, and checkout—are reliable and performant.
Key Responsibilities:
Embedded SRE & Team Collaboration
- Act as an embedded SRE for one or two product teams
- Partner with developers to define:
- Service-level indicators (SLIs) and objectives (SLOs)
- Alerting strategies that reduce noise and highlight real risk
- Rollback and deployment safety strategies
- Drive consistency in observability and reliability practices across teams
Observability & User-Focused Monitoring
- Design and implement use-case and user-journey monitoring, focusing on real customer behavior and business impact
- Configure and maintain observability dashboards that reflect:
- End-to-end user flows
- Error rates, latency, and drop-off points
- Leverage Quantum Metric to analyze customer sessions, identify experience issues, and help teams take action
- Build and standardize dashboards using Grafana
Platform & Reliability Engineering
- Support and improve reliability for services running on Kubernetes and cloud-native platforms
- Assist teams in identifying failure modes and resilience gaps
- Participate in incident response, root cause analysis, and post-incident reviews with a focus on prevention and learning
Qualifications:
Required Skills
- Proven experience as a Site Reliability Engineer or similar role
- Strong hands-on experience with Kubernetes-based environments
- Experience implementing observability solutions beyond basic infrastructure monitoring
- Ability to collaborate closely with development teams and influence reliability practices
- Strong troubleshooting and systems-thinking skills
Preferred Skills
- Experience with user-focused observability or digital experience monitoring
- Hands-on experience with Quantum Metric or similar tools
- Familiarity with Grafana for building standardized dashboards
- Prior experience in:
- Airline industry
- Travel, ecommerce, or other high-traffic consumer platforms
- Understanding of customer-critical workflows (payments, bookings, transactions)