Senior Site Reliability Engineer (Remote)

groupa United State
Remote
Apply
AI Summary

Join groupa's client as a full-time, remote Site Reliability Engineer focused on building resilient infrastructure for critical sectors. You'll leverage AI and external data to prevent catastrophic losses for utilities and infrastructure owners. This role involves designing high-availability systems, ensuring robust security, and actively troubleshooting production issues. Contribute to a mission-driven team that prioritizes problem-solving, new technologies, and community impact.

Key Highlights
Design and maintain high-availability systems with a focus on uptime during upgrades and maintenance.
Implement and manage infrastructure-as-code, CI/CD pipelines, and build engineering best practices.
Ensure system and network security through patch management, vulnerability mitigation, and dependency updates.
Set up and monitor logging, metrics, and alerting systems for proactive issue detection.
Diagnose and resolve production issues, contributing to post-mortems and participating in on-call rotations.
Provide direct customer support, gathering feedback and resolving technical issues.
Guide development teams on deployability and maintainability best practices.
Continuously learn and stay updated on industry best practices, tools, and technologies.
Mentor other SREs, providing technical guidance and expertise.
Technical Skills Required
GCP Azure AWS Kubernetes CI/CD Infrastructure-as-Code Operating Systems Computer Architecture Programming (for diagnosis, IaC, scripting)
Benefits & Perks
Full-time
Direct hire
Fully remote

Job Description


We are recruiting for a full-time, direct, and fully remote Site Reliability Engineer to join our client company’s Resiliency Solutions team to help make communities more adaptive and sustainable. This is done by pairing external data with artificial intelligence to identify areas of high risk and prevent catastrophic loss for utilities and critical infrastructure owners across the country. Join a team of close-knit engineers, subject matter experts, and business leaders who obsess over problem-solving, new technologies, and making a positive impact in our communities.


Duties & Responsibilities:

  • Design High-Availability Systems - ensure that all of the systems that we deploy and depend on are configured to maintain full uptime. Plan out deployment strategies to ensure that uptime is maintained during upgrades and maintenance. Design and build out infrastructure-as-code projects. Perform resiliency, load, and disaster recovery tests.
  • Maintain System and Network Security - patch management, ensure that dependencies are kept up to date. Stay informed about zero-day vulnerabilities and any risks that cannot be immediately patched and come up with alternative methods to mitigate their risk.
  • Logging, Metrics and Alerting - set up and monitor logs, metrics, and alerts for the systems.
  • Diagnosis and Troubleshooting - diagnose and resolve production issues. Contribute to retrospectives and post-mortems. Participate in the on-call rotation.
  • Customer Support: Regularly interface directly with customers to take direct feedback and provide top-tier customer support in resolving issues
  • Guiding Development Team with Best Practices - working with the development team to ensure that the software being built will be practical to deploy and maintain.
  • Build Engineering - managing build/deployment pipelines and ensuring best practices are followed in this.
  • Continuous Learning - Stay up-to-date with industry best practices, tools, and technologies related to infrastructure..
  • Mentorship - Work with a team of SREs, providing guidance, coaching, and technical expertise in infrastructure management.


Required Skills & Experience:

  • 5+ years of experience designing and maintaining application systems in the cloud - GCP (preferred) Azure or AWS
  • Extensive experience in Kubernetes and CI/CD pipelines
  • Excellent experience working directly with customers to take feedback and resolve issues
  • Ability to provide top-tier customer service
  • Bachelor's degree in a related field or equivalent experience.
  • People first, technology second.
  • A deep understanding of operating systems and computer architecture experience
  • Good programming abilities - for application diagnosis, infrastructure-as-code, and scripting and glue components.
  • Excellent communication and organizational skills are a must


Subscribe our newsletter

New Things Will Always Update Regularly