Site Reliability Engineer

Posted 12 days ago

Nottingham, Nottinghamshire
Any
External
Expires In 3 months

As a Site Reliability Engineer within the SRE team, you’ll be focused on monitoring and supporting the applications hosted in AWS
environments for platforms and tools utilised by our customers. The SRE team specialises in giving delivery squads visibility of the performance of their services in production and support to investigate and contain potential problems.Unlike traditional development roles, this position won't have you building features. Instead, you'll dive deep into troubleshooting issues, implementing automation solutions, containing bugs and implementing proactive measuresto uphold our system's integrity and performance. You’ll have freedom to help research and recommend solutions for hosting applications at scale. You’ll be fundamental in incident response, troubleshooting and containing issues.Key responsibilities Debug Node.js applications and contribute to their optimization and performance tuning. Configuration and ongoing management of environments and services on AWS. Enhancing tools and processes for monitoring scalable applications on AWS. Maintaining high availability through proactive measures. Troubleshooting and resolving complex technical issues. Documentation of Standard Operating Procedures. Automation of SOPs and Run Books. Respond to issues outside of working hours as per on call rota. Basic Qualifications Experience implementing environments for web-based microservices. Experience of supporting MongoDB based web applications. Experience of engineering, architecting, or supporting AWS solutions. Familiarity with cloud virtualisation tools such as ECS and/or Docker containers. Experience working with automated deployment systems (eg. CloudFormation. CodeBuild). Familiarity with any monitoring tool. for eg : NewRelic, DataDog, Prometheus, Grafana etc. Experience in automation of workloads using a scripting language like Python or JavaScript Strong problem-solving skills and the ability to troubleshoot complex issues. Good understanding of incident response best practices, post-incident reviews, and continuous improvement. Ability and willingness to proactively improve ways of working and processes. Desire to continually grow, develop and improve. Experience debugging NodeJS applications. Useful Skills Understanding of REST, GraphQL and asynchronous messaging Experience of using Git for version control. Experience of Continuous Integration and Deployment advantageous. Familiarity with core SRE principles encompassing areas such as monitoring, alerting, error budgets, fault analysis, and other prevalent concepts in the realm of reliability engineering. Excellent written and verbal communication skills. Familiarity with IT compliance and risk management requirements (eg. security, privacy, GDPR etc.) What we will offer youIt’s no secret that fast growing SaaS businesses are one of the most exciting places to start your career, and Thrive is no exception. Our team thinks differently to other businesses and we offer our employees something different, too. We’re all about trust, autonomy and doing the right thing, and we’re proud of the benefits we offer to our team. You’ll receive: Unlimited annual holiday. You did read that right!
Flexible working hours
Modern and lively offices with a fantastic culture Unbeatable THRIVE social events. Health Plan Employee Assistance Program Referral Scheme
#J-18808-Ljbffr

View more similar jobs