The following job is no longer available:
Site Reliability Engineering

Site Reliability Engineering

Posted 22 days ago

London, Greater London
Any
External
Expired - 2 months ago

At Globality, we’re proud to embody the core values of innovation, collaboration, and trust in both our culture and product.
We’re creating ground-breaking technology utilizing a world-class, AI-powered Platform that revolutionizes how businesses buy and sell services. We are an open, inclusive, and diverse organization and our employees are at the heart of the great products we create.
We’ve raised over $310M and are supported by an impressive group of prominent investors, including Al Gore and SoftBank Vision Fund. Our co-founders, Joel Hyatt and Lior Delgo, are seasoned entrepreneurs who bring an extensive business-building experience to our organization. Our impressive board includes Dennis Nally (former Global Chairman of PwC) and Ron Johnson (former SVP of Apple).
We’re excited to deliver the best in both innovative technologies and customer-focused experiences to realize our mission of creating a more inclusive global economy. Come help us build something great!
Role Summary:
Site Reliability Engineers (SREs) are responsible for keeping all customer-facing services and other Globality production systems running smoothly as a unit within the broader Production Engineering team.
SREs are a blend of pragmatic technical operators and tooling craftspeople that apply sound engineering principles, operational discipline, and mature automation to our production environment and the Globality codebase. We are a DevOps-driven culture with a particular team interest in improving our product stack insight, automation tooling, and scalability.
Globality is a unique product stack which brings unique challenges – it’s a ground-breaking technology utilizing a world-class, AI-powered microservices platform that revolutionizes how businesses buy and sell services. The experience of our team feeds back into other engineering groups within the company, perpetuating product improvement. We are an open, inclusive, and diverse organization and our employees are at the heart of the great products we create.
As an SRE you will:
Be part of the team responsible for managing an enterprise-grade AI-driven data and messaging platform.
Protect the health of the Production environment.
Be on the (non-overnight) on-call rotation to respond to Globality availability incidents and provide support for other customer-impacting incidents.
Use your on-call shift to prevent incidents from ever happening.
Run our infrastructure with tools like Spinnaker, Terraform, and Kubernetes.
Help make monitoring and alerting alert on symptoms and not on outages
Protect the health of the Production environment.
Document every action so your findings turn into repeatable actions…and then into automation.
Work with the Infrastructure and QA/TestEng teams to make the deployment process as efficient and boring as possible.
Design, build, and maintain core production infrastructure pieces.
Work with the architects to implement the baseline technologies, policies, and practices to build a high-velocity, high-security, strong compliance platform that allows Globality scaling to support exponential growth.
Keep a keen eye on security issues in every project you work on, contributing to improving security in the systems that were already in place.
Debug production issues across services and levels of the stack.
Help plan the growth of Globality's infrastructure.
Establish strong relationships with other teams in order to positively influence them in their pursuit of automation and toil reduction, and to keep the rest of our team apprised of upcoming initiatives.
Protect the health of the Production environment.
You may be a fit to this role if you:
Think deeply about edge cases, points of failure, failure modes, and systemic behaviors.
Embrace a DevOps philosophy.
Know your way around Linux and the command line.
Feel comfortable working toward delivering an end-to-end seamless CI/CD pipeline, with a goal of delivering code into production as swiftly as possible, while working with the QA/TestEng and Infrastructure teams to ensure that code is production worthy.
Have strong programming skills – Python, Go, and/or Ruby (etc.)
Maintain “production grade” adherence to best practices for the lowliest tools and scripts.
Embrace collaboration and are comfortable with communicating asynchronously.
Are driven to document, document, document so you don't need to learn (or teach) the same thing twice.
Have an enthusiastic, driven, go-for-it attitude. Are compelled to fix broken things and improve less-than-ideal things.
Have experience with Drone.io, Jenkins, Docker, Kubernetes, Terraform, Elasticsearch, or similar technologies.
Have experience using the advanced tools of AWS, GCP, or other cloud providers.
Projects you could work on:
Improve production infrastructure automation with Ansible or Terraform.
Improve our Metrics collection scope or improve our metrics-driven Monitoring story.
Work with the QA / Test Engineering team to fully pipeline our internal tools.
Work with Test Engineering on scale testing initiatives.
Reduce the noise-to-signal ratio in our alerting.
Develop a relationship with a product group, define their SLOs, help analyze our metrics data on those SLOs and improve their reliability.
Leveling of Site Reliability Engineers at Globality
Areas of expertise/contribution for up-leveling:
Technical:
Use Ansible to efficiently manage our infrastructure
Further our "Infrastructure as Code" mission using Terraform and CI/CD-focused automation
Administration of a variety of high-availability clusters.
Firm grasp of Metrics and Monitoring systems, Grafana visualization implementation, and delivery of well-targeted alerting with Slack/PagerDuty integrations.
Logging infrastructure
Backend storage management and scaling
Disaster Recovery and High Availability strategy
Script / tool authoring
Knowledge of Globality product stack and service interoperations
Contributing to code in Globality
Execution:
Team organization and planning
Issue, Epic, OKR/KPI leadership and completion
Collaboration and Communication:
Creating blog posts / confluence articles
Completing Root Cause Analysis (RCA) investigations
Contributions to handbook, runbooks, general documentation
Leading and contributing to designs for issues, epics, KPIs
Improving team practices in handoffs of work and incidents
Influence and Maturity
Involvement in hiring process – developing/reviewing questionnaires, involved in interviews, qualifying candidates
Knowledge sharing, mentoring
Accountability, self-awareness, handling conflict in the team and receiving feedback
Maintaining good relationships with other engineering teams in Globality that help improve the product
Levels for Site Reliability Engineer
Site Reliability Engineer I
Are early-career Site Reliability Engineers who are expected to work toward:
Technical:
General knowledge of at least 4 of the areas of technical expertise with deep knowledge in at least 1 area
Are able to write basic scripts and alter existing scripts
Execution:
Provides timely response to requests from Globality teammates and by reacting to alerts from monitoring and appropriately escalating when needed
Proposes ideas and solutions within the Production Engineering team to reduce the workload through automation.
Execute solutions within the production ecosystem to reach specific goals agreed upon within the team.
Execute configuration change operations at the infrastructure level.
Actively looks for opportunities to improve the availability and performance of the system by applying the knowledge gained from monitoring and observation
Collaboration and Communication:
Improves documentation all around, either in application documentation or in runbooks, explaining the ‘why’ and ‘how’, not stopping with the ‘what’.
Does not allow outdated/deprecated information to go un-flagged.
Influence and Maturity
Shares gained knowledge readily with the team, either by creating issues that provide context for anyone to understand it or by writing Confluence articles.
Contributes to the hiring process by being part of the interview team to evaluate SRE candidates for team fit
Site Reliability Engineer II
Are experienced Site Reliability Engineer I’s who meet the following criteria:
Technical:
General knowledge of 5+ of the areas of technical expertise with deep knowledge in at least 2 areas.
Are able to write well-crafted scripts and basic tools
Execution:
Provides emergency response either by being on-call or by reacting to symptoms according to monitoring and seeing them through to resolution or escalating as appropriate.
Proposes ideas and solutions within the infrastructure team to reduce the workload by automation.
Plan, design and execute solutions within infrastructure team to reach specific goals agreed within the team.
Plan and execute configuration change operations both at the application and the infrastructure level.
Actively looks for opportunities to improve the availability and performance of the system by applying the knowledge gained from monitoring and observation
Collaboration and Communication:
Improves documentation all around, either in application documentation or in runbooks, explaining the ‘why’ and ‘how’, not stopping with the ‘what’.
Does not allow outdated/deprecated information to go un-corrected.
Influence and Maturity
Shares gained knowledge readily with the team, either by creating issues that provide context for anyone to understand it or by writing Confluence articles.
Contributes to the hiring process by being part of the interview team to qualify SRE candidates
Senior Site Reliability Engineer I/II
Are experienced Site Reliability Engineers II’s who meet the following criteria
Technical:
Deep knowledge in 2+ areas of expertise and general knowledge of all areas of expertise. Capable of mentoring SRE-Is in all areas and other SREs in their area of deep knowledge.
Are able to design and build tools to improve the management of the production environment and/or infrastructure
Are able to contribute small improvement PRs to the Globality codebase to resolve issues
Execution:
Identifies significant projects that result in substantial cost savings or revenue
Identifies changes for the product architecture from the reliability, performance, and availability perspective with a data-driven approach.
Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make Globality cheaper to run.
Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.
Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.
Collaboration and Communication:
Know a domain really well and radiate that knowledge through recorded demos, discussions in ProdEng design meetings, or Incident/Root-Cause Reviews
Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.
Influence and Maturity:
Set an example for team of SREs with positive and inclusive leadership and discussion on work.
Contributes to the hiring process by being part of the interview team to qualify SRE candidates
Show ownership of a major part of the infrastructure.
Trusted to de-escalate conflicts inside the team
Staff Site Reliability Engineer
Are Senior SREs who meet the following criteria:
Technical:
Able to conceptualize, design, and create innovative solutions that push Globality's technical abilities ahead of the curve
Deep knowledge of Globality and 4 areas of expertise. Knowledge of each area of expertise enough to mentor and guide other team members in those areas.
Contributes to Globality codebase to resolve issues and add new functionality
Significant modification to open source or major from-scratch tooling to deliver best-of-breed implementation of our production ecosystem.
Execution:
Strives for automation either by coding it or by leading and influencing developers to build systems that are easy to run in production.
Measure the risk of introduced features to plan ahead and improve the infrastructure.
Proposes and drives architectural changes that affect the whole company to solve scaling and performance problems
Leads significant project work for KPI level goals for the team
Communication and Collaboration:
Works with engineers across the whole company, influencing design to create features that will work well multi-region/multi-cloud, massive-scaling implementations
Runs RCAs and epic level planning meetings to get meaningful work scheduled into the plan
Influence and Maturity:
Writes in-depth documentation that shares knowledge and radiates Globality technical strengths
Has a high level of self-awareness
Trusted to de-escalate conflicts inside and outside the team
Routinely has an impact on the broader Engineering organization
Helps to develop other team members into more senior levels and leaders in the team
We are an equal opportunity employer and a participant in the E-Verify program. We believe diversity makes teams better and that discrimination based on race, gender, or anything else is self-defeating.
#J-18808-Ljbffr

View more similar jobs