Site Reliability Engineer (SRE) - Observability and Reliability

Optomi • United State
Remote
Apply
AI Summary

Design, scale, and optimize observability solutions using Prometheus, Grafana, and Dynatrace. Apply SRE principles to improve application reliability and drive reliability maturity across multi-team environments.

Key Highlights
Design and manage Prometheus and Grafana environments
Apply SRE principles to improve application reliability
Drive reliability maturity across multi-team environments
Technical Skills Required
Prometheus Grafana PromQL Dynatrace Kubernetes Cloud platforms (AWS/GCP/Azure) CI/CD pipelines
Benefits & Perks

Job Description


Site Reliability Engineer

*6-12 month contract*

Fully Remote


Optomi, in partnership with one of our premier clients in the telecommunication industry, is seeking a highly skilled Site Reliability Engineer (SRE) with strong observability expertise, proven communication skills, and the ability to drive reliability maturity across multi-team environments. This role is ideal for someone who can blend deep technical proficiency with strategic thinking and collaborative influence.


Key Responsibilities

Observability Engineering

  • Design, scale, optimize, and manage Prometheus and Grafana environments.
  • Write advanced PromQL queries, dashboards, visualizations, and metric-based calculations.
  • Build out and maintain Grafana instances, supporting multi-team use cases.
  • Leverage Dynatrace with strong proficiency in metrics and analytics to deliver efficient, actionable observability solutions for engineering and operations teams (e.g., dashboards, insights, reports).
  • Analyze telemetry data to identify the metrics that matter (MTM), drive actionable insights, and influence engineering decisions.


Site Reliability Engineering

  • Apply and evolve an SRE Maturity Model to help teams mature across observability, resilience, automation, and reliability.
  • Establish, implement, and maintain Service Level Objectives (SLOs) and error budgets across applications and services.
  • Partner effectively with engineering, product, operations, and leadership teams; translate complex technical insights into clear, actionable communication.
  • Identify and reduce toil through automation, tooling improvements, and process refinement.
  • Support incident analysis, reliability reviews, and continuous improvement initiatives.


Required Skills & Experience

  • Familiarity with SRE principles, maturity models, and reliability roadmaps.
  • Demonstrated experience improving application reliability via data-driven decisions.
  • Hands-on experience with Prometheus, Grafana, and PromQL.
  • Strong understanding of Dynatrace, metric analysis, and observability practices.
  • Excellent communication skills and ability to collaborate across diverse technical and non-technical teams.
  • Strong analytical and problem-solving skills with a bias for action.


Nice to Have

  • Experience with Kubernetes, cloud platforms (AWS/GCP/Azure), or CI/CD pipelines.
  • Experience with automation.
  • Experience with large-scale distributed systems or high-availability architectures.

Subscribe our newsletter

New Things Will Always Update Regularly