Site Reliability Engineer

New Today

Overviewbet365 Stoke-On-Trent, England, United KingdomSite Reliability Engineer position at bet365 in Stoke-On-Trent, United Kingdom. bet365 is a leading online gambling company with a global presence and a focus on reliability and innovation in software. This role emphasizes improving system reliability, observability, and incident resolution through engineering practices.ResponsibilitiesAs a Site Reliability Engineer, you will:Enhance system reliability, observability, and performance through an engineering-driven approach.Monitor the health, performance, and availability of critical systems and directly impact operational efficiency.Implement solutions that improve reliability, including service instrumentation with tools such as OpenTelemetry, improve logging practices, and develop maintainable features.Develop tools and automation for effective service management.Collaborate across multiple functions to integrate reliability and observability best practices into the software development lifecycle.Support governance standards set by central teams to ensure reliability principles are embedded in development.Contribute to ensuring systems meet user demands and enhance overall service performance.Participate in the company’s hybrid working from home policy where applicable.QualificationsExcellent knowledge of Site Reliability Engineering principles, including the creation and management of SLIs and SLOs for reliability and customer satisfaction.Experience with modern observability tools and practices (e.g., Splunk, New Relic, Grafana, PagerDuty).Experience with modern software development techniques and lifecycles.Experience with Infrastructure as Code (IaC) automation and orchestration tools (e.g., Ansible, Terraform).Prior experience in a large-scale, 24/7 enterprise where uptime and stability are critical.Keen interest in industry trends, particularly Platform Engineering.Proficiency in shell scripting for automation and system management tasks.Additional InformationContribute to code that enhances reliability and observability, including telemetry and tooling.Develop and maintain tools to improve operational efficiency and resilience.Use automation and orchestration platforms to reduce toil and manual activity.Build dashboards using telemetry data and technologies like Grafana, Splunk, and New Relic.Maintain and administer existing monitoring and analytics toolsets.Mentor colleagues in new technologies or practices.Participate in live incident resolution and post-mortem analyses with remediation strategies to prevent recurrence.Drive initiatives to enhance system reliability and observability and contribute to a culture of continuous improvement.Collaborate with central SRE and Observability teams to uphold reliability standards and assist teams in adherence.Work with IT Operations to support tooling that delivers business value.By applying to bet365, you agree to share your Personal Data in accordance with our Recruitment Privacy Notice - https://www.bet365careers.com/privacy-policyBet365 is committed to creating an inclusive environment where everyone can grow and develop. If you need adjustments or accommodations during the recruitment process, please reach out. #J-18808-Ljbffr
Location:
Stoke-On-Trent, England, United Kingdom
Job Type:
FullTime

We found some similar jobs based on your search