Senior Platform & Site Reliability Engineer - Cloud Platform Leadership

genius match • Portugal
Remote
Apply
AI Summary

Lead platform architecture and engineering standards for a rapidly growing enterprise software organization. Own and operate shared cloud platform, CI/CD, observability, and event streaming infrastructure. Collaborate with U.S.-based team, available until at least 3:00 PM EST.

Key Highlights
Establish platform engineering standards across multiple products
Design and maintain event streaming and batch processing infrastructure
Own and improve observability, CI/CD, and deployment automation
Key Responsibilities
Own the architecture and operation of the shared platform
Define, implement, and enforce platform engineering standards
Build and maintain Infrastructure as Code using Terraform or OpenTofu
Design and maintain event streaming infrastructure supporting real-time processing workloads
Ensure reliability, scalability, performance, and cost efficiency of platform services
Design, build, and maintain CI/CD pipelines using GitHub Actions
Own and maintain the observability platform using Grafana, Prometheus, Loki, CloudWatch, and related monitoring tools
Plan and execute platform integration and modernization initiatives
Technical Skills Required
AWS Terraform/OpenTofu GitHub Actions
Benefits & Perks
Competitive market salary
Fully remote work
Nice to Have
Experience building shared platform engineering capabilities supporting multiple products or business units
Familiarity with AI-assisted engineering workflows and infrastructure automation

Job Description


Our client is a rapidly growing enterprise software organization that acquires and scales B2B SaaS products. They are building a shared cloud platform that serves as the engineering foundation for a growing portfolio of enterprise applications. This platform provides standardized infrastructure, deployment, observability, automation, and reliability capabilities across multiple products while enabling future growth without proportionally increasing operational complexity.

The organization is investing in modern platform engineering practices, cloud-native technologies, Infrastructure as Code, AI-assisted engineering, and operational automation to build a scalable, highly reliable engineering ecosystem.

They are looking for an experienced Senior Platform & Site Reliability Engineer to take ownership of the shared platform, establish engineering standards, and design the infrastructure that supports multiple enterprise SaaS products. This is a hands-on technical leadership role where you will influence platform architecture, developer experience, operational reliability, and engineering best practices.

Working Hours: This role requires daily collaboration with a U.S.-based engineering team. Candidates must be available to work until at least 3:00 PM EST (U.S. Eastern Time), with flexibility to work beyond these hours when business needs require.

Responsibilities

Platform Engineering

  • Own the architecture and operation of the shared platform, including CI/CD, observability, deployment automation, secrets management, and developer tooling.
  • Define, implement, and enforce platform engineering standards across multiple products.
  • Build and maintain Infrastructure as Code using Terraform or OpenTofu, ensuring all infrastructure is version-controlled, reviewed, and provisioned through automation.
  • Develop self-service platform capabilities that enable engineering teams to deploy independently.

Event Streaming & Data Processing

  • Design and maintain event streaming infrastructure supporting real-time processing workloads.
  • Build and support batch processing infrastructure alongside live transactional systems.
  • Ensure reliability, scalability, performance, and cost efficiency of platform services.

CI/CD & Deployment

  • Design, build, and maintain CI/CD pipelines using GitHub Actions.
  • Automate recovery for common pipeline failures and improve deployment reliability.
  • Implement release management strategies, rollback mechanisms, and deployment patterns such as canary or blue-green deployments where appropriate.

Observability & Site Reliability

  • Own and maintain the observability platform using Grafana, Prometheus, Loki, CloudWatch, and related monitoring tools.
  • Define Service Level Objectives (SLOs), error budgets, and reliability metrics across multiple products.
  • Build intelligent alerting and monitoring solutions that provide actionable diagnostic information.
  • Design incident response processes, escalation procedures, and post-incident review practices.
  • Implement safe automated remediation for well-understood operational scenarios while ensuring human oversight for complex incidents.

Platform Expansion & Integration

  • Assess newly onboarded products for infrastructure maturity, Infrastructure as Code coverage, observability, and security.
  • Plan and execute platform integration and modernization initiatives while minimizing operational disruption.
  • Support the adoption of standardized platform capabilities across multiple engineering teams.

Engineering Automation

  • Leverage AI-assisted engineering tools and automation where appropriate to reduce operational overhead.
  • Automate infrastructure provisioning, CI/CD workflows, monitoring, secrets management, and operational tasks while maintaining engineering oversight for high-impact decisions.

Preferred Technology Stack

  • AWS
  • Terraform / OpenTofu
  • GitHub Actions
  • Grafana
  • Prometheus
  • Loki
  • AWS CloudWatch
  • AWS Secrets Manager or HashiCorp Vault
  • Amazon ECS and EKS
  • Event streaming technologies
  • Cost monitoring and cloud optimization tools

Requirements

  • 8–12 years of experience in Platform Engineering, Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure Engineering.
  • Proven experience designing and operating production platform infrastructure across multiple environments or products.
  • Strong hands-on experience with Terraform (or OpenTofu) and Infrastructure as Code.
  • Extensive experience designing and maintaining CI/CD pipelines using GitHub Actions.
  • Experience operating event streaming infrastructure in production environments.
  • Strong AWS expertise, including ECS, EKS, IAM, VPC, RDS, CloudWatch, networking, and cloud infrastructure.
  • Hands-on experience with Grafana, Prometheus, Loki, and enterprise observability platforms.
  • Strong understanding of SRE principles, including SLOs, error budgets, incident response, and operational excellence.
  • Experience designing scalable, secure, highly available cloud infrastructure.
  • Strong troubleshooting, automation, and problem-solving skills.
  • Excellent communication skills with the ability to establish engineering standards across multiple teams.

Nice to Have

  • Experience building shared platform engineering capabilities supporting multiple products or business units.
  • Experience integrating newly acquired products or modernizing legacy platforms.
  • Experience designing developer self-service platforms.
  • Familiarity with AI-assisted engineering workflows and infrastructure automation.
  • Experience supporting high-volume enterprise SaaS products and distributed systems.
  • Strong focus on cloud cost optimization and operational efficiency.

What We Offer

  • Competitive market salary.
  • Fully remote work.
  • Opportunity to build and shape the engineering platform supporting a growing portfolio of enterprise SaaS products.
  • Work alongside experienced international engineering teams.
  • Exposure to modern cloud technologies, AI-assisted engineering, automation, and large-scale platform initiatives.
  • Professional growth through ownership of platform architecture, operational reliability, and engineering standards.
  • Daily collaboration with a U.S.-based engineering team, with availability required until at least 3:00 PM EST and flexibility to work longer when needed.



Similar Jobs

Explore other opportunities that match your interests

Senior Cloud Platform Engineer (Remote, Portugal)

Devops
•
4h ago

Premium Job

Sign up is free! Login or Sign up to view full details.

•••••• •••••• ••••••
Job Type ••••••
Experience Level ••••••

TMC

Portugal

Senior AWS DevOps Engineer

Devops
•
1w ago
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

DataCareers

Portugal
Visa Sponsorship Relocation Remote
Job Type Full-time
Experience Level Mid-Senior level

WE ARE META

Portugal

Subscribe our newsletter

New Things Will Always Update Regularly