Senior Site Reliability Engineer

basis theory United State
Remote
Apply
AI Summary

Design and implement reliable, scalable, and observable systems. Collaborate with product and platform teams to build a metrics-first culture. Drive deployment safety and canary rollouts.

Key Highlights
Lead efforts to turn metrics into action and improve system reliability
Design for performance, observability, and operational safety
Collaborate with product and platform teams to build a metrics-first culture
Drive deployment safety and canary rollouts
Partner with Engineering to evolve scaling patterns and proactive action
Technical Skills Required
AWS Terraform Kubernetes Go Python Node.js DataDog Prometheus OpenTelemetry GitHub Actions Jenkins ArgoCD
Benefits & Perks
Remote work environment
Annual company get-together at a new tropical location
Monthly stipend for remote working environments

Job Description


Job Description

As a Senior Site Reliability Engineer at Basis Theory, you will play a pivotal role in ensuring that our systems are measurable, reliable, and continuously improving; not just monitored. You’ll lead efforts to turn metrics into action, engineering for reliability and scalability across our platform, and building the guardrails that keep our systems resilient as we grow.

In this role, you’ll leverage your deep technical expertise to design for performance, observability, and operational safety. You will collaborate closely with our product and platform teams to build a metrics-first culture that prevents incidents before they happen and continuously improves developer velocity and customer trust.

About us:

Basis Theory offers a fully programmable vault to create engaging commerce flows, connect with any partner, effortlessly manage compliance, and keep control of payments data. Standing at the intersection of technology and commerce, Basis Theory’s PCI Level 1, SOC2 type 2, and ISO 27001-compliant vault revolutionizes the way fintechs and merchants build their payment infrastructure by providing unparalleled flexibility and customization, enabling businesses to tailor their payment stacks to their unique needs. From emerging fintech startups to established merchants, Basis Theory provides the tools and support necessary for each to craft a payment stack that perfectly aligns with their business model.

Basis Theory is building from first-hand experience at Twilio, Klarna, and Dwolla and has raised over $50 million from top-tier investors, including Bessemer Venture Partners, Cosanoa Ventures, Stage 2 Capital, and Kindred Ventures. We are a globally distributed team that operates as a remote-first organization from the monthly stipend for remote working environments to our annual company get together at a new tropical location each year :)

What you’ll be responsible for:

  • Hands-on member of engineering, with a focus on reliability, performance, and observability.
  • Work closely with Principal Engineers and CTO to define SLIs, SLOs, and error budgets for key systems.
  • Leading cost optimization efforts by improving our use of metrics vs. logs, right-sizing trace sampling, tuning ingestion/indexing, and exploring AWS-native monitoring alternatives.
  • Building and improving tooling for local and automated performance testing, and tracking benchmarks over time to identify bottlenecks.
  • Driving deployment safety and canary rollouts, using UAT as a testbed, and creating feedback loops that automatically assess rollout success.
  • Leading chaos and resilience testing, including monthly tabletop exercises, failover drills, and continuous verification of redundancy assumptions.
  • Partnering with Engineering to evolve scaling patterns (autoscaling, architectures, etc), including proactive action when new features or metrics reveal risk.

You may be a good fit if:

  • You’re an engineer first; someone who wants to build, not just observe.
  • You have deep experience with observability tools (DataDog, Prometheus, OpenTelemetry) and can design for metrics-first reliability.
  • You care deeply about performance, reliability, and operational simplicity.
  • You have a strong understanding of CI/CD, deployment safety, and change control patterns.
  • You value collaboration and enjoy pairing with other engineers to improve standards, design decisions, and system health.
  • You thrive in ambiguity and want to shape what great SRE looks like in a product-led, engineering-driven culture.

Other experiences that may help:

  • You’ve worked with AWS at scale (EKS, Lambda, CloudWatch, DynamoDB, S3).
  • Building or scaling systems handling tens of millions of API calls per week.
  • Prior experience in payments, fintech, or other compliance-sensitive environments (PCI, SOC2).
  • Implementing automated performance and load testing pipelines.
  • Designing or running incident response and chaos testing programs.

Skillsets

Required

  • Production experience in cloud infrastructure and observability (AWS, Terraform, Kubernetes).
  • Strong systems and debugging skills across the stack (networking, services, data).
  • Experience designing and monitoring SLIs/SLOs, and reducing alert noise.
  • Ability to write code in one or more backend languages (Go, Python, or Node.js).
  • Experience with CI/CD tooling (e.g., GitHub Actions, Jenkins, ArgoCD).

Desired

  • Experience optimizing observability spend and tuning DataDog, Prometheus, or similar.
  • Experience with chaos engineering, progressive deployments, and auto-remediation.
  • Exposure to high-throughput, latency-sensitive, or globally distributed systems.

Experience

  • 5+ years in SRE, Platform, or DevOps roles building and maintaining production-scale distributed systems.


Subscribe our newsletter

New Things Will Always Update Regularly