Senior Site Reliability Engineer (Remote)

groupa • United State

Remote

Apply

AI Summary

Join groupa's client as a full-time, remote Site Reliability Engineer focused on building resilient infrastructure for critical sectors. You'll leverage AI and external data to prevent catastrophic losses for utilities and infrastructure owners. This role involves designing high-availability systems, ensuring robust security, and actively troubleshooting production issues. Contribute to a mission-driven team that prioritizes problem-solving, new technologies, and community impact.

Key Highlights

Design and maintain high-availability systems with a focus on uptime during upgrades and maintenance.

Implement and manage infrastructure-as-code, CI/CD pipelines, and build engineering best practices.

Ensure system and network security through patch management, vulnerability mitigation, and dependency updates.

Set up and monitor logging, metrics, and alerting systems for proactive issue detection.

Diagnose and resolve production issues, contributing to post-mortems and participating in on-call rotations.

Provide direct customer support, gathering feedback and resolving technical issues.

Guide development teams on deployability and maintainability best practices.

Continuously learn and stay updated on industry best practices, tools, and technologies.

Mentor other SREs, providing technical guidance and expertise.

Technical Skills Required

GCP Azure AWS Kubernetes CI/CD Infrastructure-as-Code Operating Systems Computer Architecture Programming (for diagnosis, IaC, scripting)

Benefits & Perks

Full-time

Direct hire

Fully remote

Job Description

We are recruiting for a full-time, direct, and fully remote Site Reliability Engineer to join our client company’s Resiliency Solutions team to help make communities more adaptive and sustainable. This is done by pairing external data with artificial intelligence to identify areas of high risk and prevent catastrophic loss for utilities and critical infrastructure owners across the country. Join a team of close-knit engineers, subject matter experts, and business leaders who obsess over problem-solving, new technologies, and making a positive impact in our communities.

Duties & Responsibilities:

Design High-Availability Systems - ensure that all of the systems that we deploy and depend on are configured to maintain full uptime. Plan out deployment strategies to ensure that uptime is maintained during upgrades and maintenance. Design and build out infrastructure-as-code projects. Perform resiliency, load, and disaster recovery tests.
Maintain System and Network Security - patch management, ensure that dependencies are kept up to date. Stay informed about zero-day vulnerabilities and any risks that cannot be immediately patched and come up with alternative methods to mitigate their risk.
Logging, Metrics and Alerting - set up and monitor logs, metrics, and alerts for the systems.
Diagnosis and Troubleshooting - diagnose and resolve production issues. Contribute to retrospectives and post-mortems. Participate in the on-call rotation.
Customer Support: Regularly interface directly with customers to take direct feedback and provide top-tier customer support in resolving issues
Guiding Development Team with Best Practices - working with the development team to ensure that the software being built will be practical to deploy and maintain.
Build Engineering - managing build/deployment pipelines and ensuring best practices are followed in this.
Continuous Learning - Stay up-to-date with industry best practices, tools, and technologies related to infrastructure..
Mentorship - Work with a team of SREs, providing guidance, coaching, and technical expertise in infrastructure management.

Required Skills & Experience:

5+ years of experience designing and maintaining application systems in the cloud - GCP (preferred) Azure or AWS
Extensive experience in Kubernetes and CI/CD pipelines
Excellent experience working directly with customers to take feedback and resolve issues
Ability to provide top-tier customer service
Bachelor's degree in a related field or equivalent experience.
People first, technology second.
A deep understanding of operating systems and computer architecture experience
Good programming abilities - for application diagnosis, infrastructure-as-code, and scripting and glue components.
Excellent communication and organizational skills are a must

Job Overview

Posted Date Dec 03, 2025

Employment Type Full-time

Experience Level Mid-Senior level

Location United State

Category Devops

Company groupa

Senior Site Reliability Engineer (Remote)

Key Highlights

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Senior Site Reliability Engineer (Remote)

Key Highlights

Technical Skills Required

Benefits & Perks

Job Description

Job Overview

Mentioned Skills

Industries

Subscribe our newsletter