Site Reliability Engineer - Production Systems Expert

humanitapp United State
Remote
Apply
AI Summary

Experienced Site Reliability Engineer needed for a remote contract opportunity. 3+ years of SRE, DevOps, or production engineering experience required. Proficient with observability stacks, Linux systems, and container orchestration.

Key Highlights
Remote contract opportunity
3+ years of SRE, DevOps, or production engineering experience
Proficient with observability stacks, Linux systems, and container orchestration
Key Responsibilities
Author and review complex, realistic scenarios grounded in production incidents
Cover root cause analysis, monitoring and alerting, capacity planning, and post-incident remediation
Help evaluate and train AI models that reason about system failures and operational best practices
Technical Skills Required
Prometheus Grafana Datadog PagerDuty Linux systems Networking (TCP/IP, DNS, load balancing) Container orchestration (Kubernetes, Docker) Infrastructure-as-code (Terraform, Pulumi, CloudFormation) CI/CD pipelines
Benefits & Perks
$100-$160/hr
Remote work
Contract opportunity

Job Description


HumaniT is referring experienced Site Reliability Engineers to a remote contract opportunity a platform trusted by leading AI labs and Fortune 10 companies.


Role: Site Reliability Engineer — Production Systems Expert

Type: Independent Contractor | Fully Remote

Location: United States only

Rate: $100–$160/hr

Start Date: Late March, with additional openings in April


Who this is for:

— 3+ years of SRE, DevOps, or production engineering experience at a big tech company or leading startup

— Experience serving in on-call rotations managing Tier 1/Tier 2 production services with meaningful SLA requirements

— Proficient with observability stacks: Prometheus, Grafana, Datadog, PagerDuty, or equivalent

— Deep knowledge of Linux systems, networking (TCP/IP, DNS, load balancing), and container orchestration (Kubernetes, Docker)

— Hands-on with infrastructure-as-code (Terraform, Pulumi, CloudFormation) and CI/CD pipelines

— Strong debugging skills from application-level tracing to kernel-level diagnostics


What you will do:

— Author and review complex, realistic scenarios grounded in production incidents

— Cover root cause analysis, monitoring and alerting, capacity planning, and post-incident remediation

— Help evaluate and train AI models that reason about system failures and operational best practices


This project is currently in a pilot phase — participants are expected to be highly engaged with project leadership.


Applications are reviewed on a rolling basis.


Explore more opportunities at humanitapp.com


#SRE #SiteReliabilityEngineering #DevOps #RemoteWork #AIResearch #NowHiring


Similar Jobs

Explore other opportunities that match your interests

Cloud Python Developer

Programming
6h ago
Visa Sponsorship Relocation Remote
Job Type Contract
Experience Level Mid-Senior level

amtex systems inc

United State

Analytics Engineer

Programming
6h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Entry level

Lensa

United State
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Director

remotehunter

United State

Subscribe our newsletter

New Things Will Always Update Regularly