The following job is no longer available:
Principal Site Reliability Engineer

Principal Site Reliability Engineer

Posted 21 days ago

London, Greater London
Any
External
Expired - 2 months ago

The Role
We are looking for a Principal Site Reliability Engineer to lead Site Reliability at Prolific, focusing on advancing the resilience and scalability of our GCP and AWS environments. You will play a pivotal role in overseeing and enhancing our Kubernetes clusters in GCP, which support our Django application, and in driving the SRE strategic transition to AWS, particularly towards serverless and event-driven architecture.
What you'll be doing
Strategic oversight of continuous monitoring, maintenance, and optimisation processes for our Django application, ensuring highest levels of performance and reliability.
Lead the evolution of our cloud Kubernetes estate, focusing on advanced security, reliability, and observability strategies.
Spearhead infrastructure optimisations and architectural improvements in collaboration with cross-functional teams, addressing complex challenges and ensuring scalability.
Promote knowledge sharing and reduce silos across teams to strengthen resilience and reduce dependency on key individuals, increasing our Bus Factor.
Drive hands-on coding and system design improvements, with a focus on Python/Django, to optimise system performance and efficiency.
Develop comprehensive documentation and training programs to elevate the operability skills of the engineering team and foster a culture of continuous learning.
Support our Service Delivery response strategies, by being part of an out-of-hours support rota, and collaborating with our Service Delivery Lead to enhance overall service quality.
Lead security initiatives, addressing emerging threats, ensuring robust compliance, and setting best practices for the organisation.
What you’ll bring
Extensive experience as a Site Reliability Engineer / Platform Engineer, with proven staff-plus leadership in managing a large-scale enterprise Kubernetes platform in GCP.
Deep expertise in security, compliance, and cloud architecture best practices.
A track record of implementing observability-first approaches and familiarity with tools like Datadog.
Experience in leading out-of-hours incident management and on-call rotations.
Demonstrated ability to mentor teams, lead strategic initiatives, and drive significant technology transformations.
Certification in any of the below would be an advantage, but not required
GCP Professional Cloud Architect
GCP Professional Security Engineer
GCP Professional Networking Engineer
GCP DevOps Engineer
CKA (Certified Kubernetes Administrator)
#J-18808-Ljbffr

View more similar jobs