Senior Observability Engineer - Build & Scale Distributed Systems

Jobgether • United State
Remote
Apply
AI Summary

Design, build, and operate enterprise-grade observability platforms. Collaborate with SREs and product teams to define SLOs and transform raw telemetry into actionable insights. Improve incident response efficiency in a fast-paced, engineering-driven environment.

Key Highlights
Build and scale observability platforms across metrics, logs, traces, and events
Define and implement SLOs, SLIs, and alerting strategies
Develop high-quality dashboards and observability standards
Manage distributed tracing pipelines and optimize large-scale time-series systems
Technical Skills Required
Prometheus Grafana Loki Tempo OpenTelemetry Datadog Go Python Java
Benefits & Perks
Competitive annual salary ranging from $100,000 to $150,000
100% remote role within the continental United States
Comprehensive benefits package

Job Description


This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Observability Engineer based in the United States.

This role is focused on building and scaling the observability backbone that enables engineering teams to operate complex distributed systems with confidence. You will design and run end-to-end telemetry platforms covering metrics, logs, traces, and events, ensuring high signal quality and operational reliability. The position spans both infrastructure and software engineering, combining platform architecture with hands-on implementation of monitoring, alerting, and tracing systems. You will work closely with SREs, platform engineers, and product teams to define meaningful SLOs and transform raw telemetry into actionable insights. The environment is fast-paced and engineering-driven, with a strong emphasis on automation, scalability, and developer experience. This is a high-impact role where your work directly influences system reliability, incident response efficiency, and production visibility across the organization.

Accountabilities

  • Design, build, and operate enterprise-grade observability platforms across metrics, logs, traces, and events.
  • Architect and maintain scalable monitoring stacks using Prometheus, Grafana, Loki, Tempo, OpenTelemetry, and Datadog.
  • Define and implement SLOs, SLIs, error budgets, and alerting strategies aligned with system reliability goals.
  • Develop high-quality dashboards, alerts, and observability standards to reduce noise and improve signal accuracy.
  • Manage distributed tracing pipelines and enable teams to diagnose latency and performance issues effectively.
  • Operate large-scale time-series and log systems, optimizing for performance, retention, and cost efficiency.
  • Build self-service observability tooling, templates, and libraries to improve adoption across engineering teams.
  • Integrate observability practices into CI/CD pipelines, incident response workflows, and progressive delivery systems.
  • Improve incident response readiness through better alerting hygiene, dashboards, and postmortem tooling.
  • Maintain clear documentation, onboarding guides, and runbooks for observability systems and standards.
  • Mentor engineers on observability best practices, debugging techniques, and SRE principles.

Requirements

  • Bachelor’s degree in Computer Science or a related technical field.
  • 5+ years of experience in SRE, platform engineering, or observability-focused roles.
  • Strong hands-on experience with Prometheus, Grafana, and at least one commercial observability tool (Datadog, New Relic, or Splunk).
  • Deep understanding of OpenTelemetry, distributed tracing, and structured logging practices.
  • Proficiency in at least one programming language (Go, Python, or Java).
  • Experience operating high-scale metrics and logging pipelines with attention to performance and cost.
  • Strong knowledge of SLOs, error budgets, and reliability engineering principles.
  • Experience integrating observability into CI/CD pipelines and incident management tools.
  • Solid understanding of Linux systems, networking fundamentals, and containerized environments.
  • Strong communication skills and ability to collaborate across engineering and operations teams.
  • Exposure to tools such as Thanos, Mimir, Cortex, Loki, or Tempo is a plus.
  • Experience with observability cost optimization or eBPF-based tooling is a strong advantage.

Benefits

  • Competitive annual salary ranging from $100,000 to $150,000 based on experience.
  • 100% remote role within the continental United States.
  • Full-time W2 employment with long-term, multi-year engagement stability.
  • Comprehensive benefits package including healthcare and standard employee benefits.
  • Opportunity to work on large-scale distributed systems and modern observability stacks.
  • Exposure to industry-leading tools and cloud-native observability technologies.
  • Strong engineering culture focused on reliability, automation, and continuous improvement.
  • Career growth opportunities in SRE, platform engineering, and cloud observability domains.

How Jobgether Works

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.


Similar Jobs

Explore other opportunities that match your interests

Senior FPGA Firmware Engineer II - Remote (Space Domain Awareness)

Programming
•
2h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

Actalent

United State

Junior Software Engineer - Core API Operations

Programming
•
2h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

next match ai

United State

Training Manager

Programming
•
3h ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Not Applicable

joblet-ai

United State

Subscribe our newsletter

New Things Will Always Update Regularly