Join our growing SRE team as a Site Reliability Engineer to ensure the reliability, performance, scalability, and availability of enterprise platforms. Focus on observability, incident response, automation, and reliability engineering. Collaborate with development and cloud teams to build resilient systems.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
Job title: Site Reliability Engineer
Job type: Contract
Contract Length: 6-12 months
Rate: Competitive
Role Location: Remote (Must be based in the united states)
The company:
We’re partnering with an innovative, technology driven organization known for its smart use of proprietary technology for the financial services industry. The business builds platforms that genuinely make a difference - helping clients streamline operations, reduce risk, and operate more efficiently at scale. You’ll be joining a team that values accuracy, innovation, and collaboration, where technology is seen as a true enabler rather than just a support function.
Role:
We are seeking a Site Reliability Engineer (SRE) to join our growing SRE team, reporting to the SRE Lead. This role is critical to ensuring the reliability, performance, scalability, and availability of LERETA’s enterprise platforms.
You will focus on observability, incident response, automation, and reliability engineering, working closely with development and cloud teams to build resilient systems. This is a hands on role for engineers who believe in eliminating toil through automation and continuously improving system reliability.
Responsibilities:
- Ensure the reliability, availability, performance, and scalability of LERETA’s enterprise platforms by monitoring system health, defining and tracking SLIs/SLOs, and proactively identifying risks
- Own and continuously improve the observability stack (Datadog, Application Insights), including dashboards, alerting, tracing, and metrics
- Participate in a 24/7 on-call rotation, responding to P1/P2 incidents within defined SLOs, acting as Incident Commander or SME, and coordinating resolution across engineering teams
- Lead incident management, root cause analysis, and blameless post-incident reviews, driving measurable improvements from learnings
- Reduce operational toil through automation, building self-healing systems, remediation tooling, runbooks, and automated scaling and recovery processes
- Improve CI/CD pipeline reliability, deployment safety, rollback strategies, and release processes
- Perform capacity planning, load testing, performance tuning, and scaling activities
- Design, test, and maintain disaster recovery and resilience strategies, including chaos engineering experiments and Game Day exercises
- Influence system architecture by partnering with development and cloud teams, participating in architecture reviews, and advocating for reliability best practices
- Document operational standards, mentor teams on reliability practices, and promote a blameless, continuous-improvement culture
Job Requirements:
- Strong Linux and Windows administration and troubleshooting
- Python, PowerShell, Bash, or Go
- Datadog, Prometheus, Grafana
- Kubernetes (AKS), Docker, container troubleshooting
- Azure DevOps Pipelines (preferred); GitOps (FluxCD or ArgoCD) Azure (App Services, AKS, Functions, Storage, Key Vault, Cosmos DB)
- Application and infrastructure tuning
- Proven incident response experience
If you’re an experienced Site Reliability Engineer looking for your next fully remote contract, we’d love to hear from you.
Similar Jobs
Explore other opportunities that match your interests
Bright Vision Technologies
executiveplacements.com