Join groupa's client as a full-time, remote Site Reliability Engineer focused on building resilient infrastructure for critical sectors. You'll leverage AI and external data to prevent catastrophic losses for utilities and infrastructure owners. This role involves designing high-availability systems, ensuring robust security, and actively troubleshooting production issues. Contribute to a mission-driven team that prioritizes problem-solving, new technologies, and community impact.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
We are recruiting for a full-time, direct, and fully remote Site Reliability Engineer to join our client company’s Resiliency Solutions team to help make communities more adaptive and sustainable. This is done by pairing external data with artificial intelligence to identify areas of high risk and prevent catastrophic loss for utilities and critical infrastructure owners across the country. Join a team of close-knit engineers, subject matter experts, and business leaders who obsess over problem-solving, new technologies, and making a positive impact in our communities.
Duties & Responsibilities:
- Design High-Availability Systems - ensure that all of the systems that we deploy and depend on are configured to maintain full uptime. Plan out deployment strategies to ensure that uptime is maintained during upgrades and maintenance. Design and build out infrastructure-as-code projects. Perform resiliency, load, and disaster recovery tests.
- Maintain System and Network Security - patch management, ensure that dependencies are kept up to date. Stay informed about zero-day vulnerabilities and any risks that cannot be immediately patched and come up with alternative methods to mitigate their risk.
- Logging, Metrics and Alerting - set up and monitor logs, metrics, and alerts for the systems.
- Diagnosis and Troubleshooting - diagnose and resolve production issues. Contribute to retrospectives and post-mortems. Participate in the on-call rotation.
- Customer Support: Regularly interface directly with customers to take direct feedback and provide top-tier customer support in resolving issues
- Guiding Development Team with Best Practices - working with the development team to ensure that the software being built will be practical to deploy and maintain.
- Build Engineering - managing build/deployment pipelines and ensuring best practices are followed in this.
- Continuous Learning - Stay up-to-date with industry best practices, tools, and technologies related to infrastructure..
- Mentorship - Work with a team of SREs, providing guidance, coaching, and technical expertise in infrastructure management.
Required Skills & Experience:
- 5+ years of experience designing and maintaining application systems in the cloud - GCP (preferred) Azure or AWS
- Extensive experience in Kubernetes and CI/CD pipelines
- Excellent experience working directly with customers to take feedback and resolve issues
- Ability to provide top-tier customer service
- Bachelor's degree in a related field or equivalent experience.
- People first, technology second.
- A deep understanding of operating systems and computer architecture experience
- Good programming abilities - for application diagnosis, infrastructure-as-code, and scripting and glue components.
- Excellent communication and organizational skills are a must