Site Reliability Engineer

ARA • United State

Remote

This Job is No Longer Active This position is no longer accepting applications

AI Summary

Partner with software developers and IT staff to improve system design and operability. Define and maintain operational standards and runbooks. Provide advanced technical support and troubleshooting for complex platform and service issues.

Key Highlights

Partner with software developers and IT staff

Define and maintain operational standards

Provide advanced technical support

Key Responsibilities

Partner with software developers, platform engineers, and IT staff to improve system design, operability, deployment safety, and production support readiness.

Define and maintain operational standards, runbooks, support procedures, escalation paths, and service-level objectives.

Evaluate system architecture and changes to ensure they balance functional requirements, service quality, reliability, security, and compliance needs.

Drive continuous improvement in platform stability, maintenance, and availability.

Provide advanced technical support and troubleshooting for complex platform and service issues affecting internal users and stakeholders.

Technical Skills Required

Linux systems administration Kubernetes Helm Kustomize GitLab Artifactory Jira Confluence GitOps FluxCD ArgoCD Python Go Bash Monitoring Dashboards Logging Tracing

Benefits & Perks

Remote work

Security clearance

Nice to Have

Experience with multiple Linux distributions including Ubuntu.

Experience with at least one of the following: Tanzu Kubernetes, Nutanix Kubernetes Platform, Canonical Kubernetes.

Experience with cloud platforms such as AWS and Azure.

Experience with infrastructure automation and configuration management.

Experience managing AI tooling on Kubernetes including MCP Servers, LLM platforms (vLLM, Ollama), Kubeflow.

Job Description

Essential Functions

Partner with software developers, platform engineers, and IT staff to improve system design, operability, deployment safety, and production support readiness.
Define and maintain operational standards, runbooks, support procedures, escalation paths, and service-level objectives.
Evaluate system architecture and changes to ensure they balance functional requirements, service quality, reliability, security, and compliance needs.
Drive continuous improvement in platform stability, maintenance, and availability.
Provide advanced technical support and troubleshooting for complex platform and service issues affecting internal users and stakeholders.

Experience And Skills Required

8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Systems Engineering, or related infrastructure roles supporting production services.
Strong experience with Linux systems administration and troubleshooting in enterprise environments.
Strong experience operating and maintaining on-prem Kubernetes platforms and all related components including CRI, CNI, and CSI plugins.
Experience deploying and maintaining applications on Kubernetes using Helm, Kustomize, and similar tooling.
Experience supporting DevOps tooling such as GitLab, Artifactory, Jira, Confluence.

Interested in remote work opportunities in Devops? Discover Devops Remote Jobs featuring exclusive positions from top companies that offer flexible work arrangements.

Experience with GitOps tools such as FluxCD or ArgoCD.
Proficiency scripting with at least one of Python, Go, or Bash.
Strong experience designing, maintaining, and maturing observability tooling including monitoring, dashboards, logging and tracing, and supporting SLOs.
Strong understanding of reliability engineering concepts:

Service health indicators
High availability design, failure reduction, and testing
Operational readiness practices, including developing documentation, runbooks, and architectural descriptions
Incident response, root cause analysis, remediation/recovery

Ability to obtain a security clearance, which includes U.S. citizenship.

Preferred

Experience with multiple Linux distributions including Ubuntu.

Browse our curated collection of remote jobs across all categories and industries, featuring positions from top companies worldwide.

Experience with at least one of the following: Tanzu Kubernetes, Nutanix Kubernetes Platform, Canonical Kubernetes.
Experience with cloud platforms such as AWS and Azure.
Experience with infrastructure automation and configuration management.
Experience managing AI tooling on Kubernetes including MCP Servers, LLM platforms (vLLM, Ollama), Kubeflow.
Experience with security and compliance considerations in regulated environments.
DoD experience.
Active or inactive Secret Security Clearance.

Education

Bachelor’s degree in CS, Software Engineering or other IT-related field or equivalent experience

REMOTE WORK NOTICE: This position may be performed fully remote, hybrid, or onsite at an ARA office. Preference will be given to candidates located onsite in the Albuquerque area.

Job Overview

Posted Date May 02, 2026

Employment Type Full-time

Experience Level Not Applicable

Location United State

Category Devops

Company ARA

Mentioned Skills

Industries

Similar Jobs

Explore other opportunities that match your interests

VP Engineering

Devops

•

37m ago

Premium Job

•••••• •••••• ••••••

Job Type ••••••

Experience Level ••••••

qp group

United State

Technical Business Analyst for Higher Education Integrations

Devops

•

9h ago

Visa Sponsorship Relocation Remote

Job Type Contract

Experience Level Mid-Senior level

CDW

United State

Quality & DevOps Engineer

Devops

•

12h ago

Visa Sponsorship Relocation Remote

Job Type Full-time

Experience Level Not Applicable

three+one

United State

Site Reliability Engineer

Key Highlights

Key Responsibilities

Technical Skills Required

Benefits & Perks

Nice to Have

Job Description

Job Overview

Mentioned Skills

Industries

Similar Jobs

VP Engineering

Premium Job

qp group

Technical Business Analyst for Higher Education Integrations

CDW

Quality & DevOps Engineer

three+one

Subscribe our newsletter