Join a growing FinTech company as a Site Reliability Engineer. This fully remote position requires 5+ years of experience in Site Reliability Engineering, with a focus on application performance monitoring, cloud experience, and CI/CD tools.
Key Highlights
Technical Skills Required
Benefits & Perks
Job Description
OneSparQ is looking for a Site Reliability Engineer to join a growing FinTech Company. This position is fully remote.
Required Skills:
- 5+ years of experience in Site Reliability Engineering
- Experience debugging complex problems
- Experience with application performance monitoring and observability tools such
- as Datadog
- Cloud experience - preferably in AWS
- Solid experience with scripting languages (e.g. shell scripts, Python)
- Git and CI/CD tools (GitHub, Jenkins, etc.)
- SOA architecture experience utilizing micro services
- Experience with scalable, high performance, multi-tier, enterprise application
- development
Additional Skills: (not required)
- Experience with virtualization, containerization and orchestration (e.g.
- VMware, Kubernetes, etc.)
- Experience with provisioning configuration management solutions such as
- Terraform, Ansible, SaltStack, etc.
Responsibilities:
- Keeping service up and running or getting it back up and running quickly when failure occurs
- Deployment of new builds to production
- Monitoring application performance
- Work closely with internal partners and teams to ensure that we ship software that meets security, SLA, and performance requirements
- Implement Operational Automation (IaC) for Monitoring, Managing, Deploying and Validating of Systems/Applications
- On-call support
- Manage and expand relationships with internal development and outsourced managed service partners for software systems design and development
- Triage alerts diagnose/resolve critical issues, manage implementation of changes
- Coordinate capacity planning
- Develop CI/CD orchestration systems to reduce friction for software delivery to production
- Define, execute, and analyze Operational Acceptance Test initiatives
- Write, update, and use documentation, including runbooks/playbooks
- Automate work including infrastructure needs, testing, failover solutions, failure mitigation, and much more