Senior Lead Site Reliability Engineer/ Expert/ Specialist (DevOps & Automation)

2 Days Old

Senior Lead Site Reliability Engineer / Expert / Specialist (DevOps & Automation)

Join to apply for the Senior Lead Site Reliability Engineer / Expert / Specialist (DevOps & Automation) role at SITA

We’re the team that keeps airports moving, airlines flying smoothly, and borders open. Our technology and communication innovations are the secret behind the success of the world’s air travel industry.

Purpose

Responsible for the proactive support of products to maintain high performance and continuous improvement. Identify and resolve root causes of operational incidents, implement solutions to improve stability and prevent recurrence, manage the event catalog, develop remediation approaches and automated workflows, oversee deployment of IT services and solutions with minimal disruption, and focus on operational automation and integration to enhance efficiency and collaboration between development and operations.

What will you do

  • Define, build, and maintain support systems to ensure high availability and performance.
  • Handle complex cases for the PSO.
  • Build events to add to the event catalog for the relevant product or application.
  • Implement automation for system provisioning, self‑healing, auto recovery, deployment, and monitoring.
  • Perform incident response and root cause analysis for critical system failures.
  • Monitor system performance and establish service‑level indicators (SLIs) and objectives (SLOs).
  • Collaborate with development and operations to integrate reliability best practices, including moving to zero‑downtime architecture.
  • Proactively identify and remediate performance issues.
  • Conduct thorough problem investigations and root cause analyses (RCA) to diagnose recurring incidents and service disruptions.
  • Coordinate with incident management teams and collaborate with PSO and Engineering teams to develop and implement permanent solutions.
  • Monitor the effectiveness of problem‑resolution activities, provide regular reports on problem‑management activities, and ensure continuous improvement.
  • Define and maintain an event catalog, specifying active events, thresholds, and relevant remediation; optimize it for efficiency.
  • Develop event response protocols, provide training to teams, and ensure quick and efficient handling of incidents.
  • Collaborate with stakeholders to define events, ensure coverage across the PSO, and drive improvements based on post‑event reviews and feedback.
  • Own the quality of deployments for the PSO, ensuring a clear process and responsibilities are assigned for smooth implementation.
  • Develop and maintain deployment schedules, conduct operational readiness assessments, and manage deployment risk assessments to ensure service stability.
  • Oversee the execution of deployment plans, coordinate resources, communicate with stakeholders, and continuously improve deployment processes based on feedback.
  • Manage continuous integration and deployment (CI/CD) pipelines, ensuring smooth integration between development and operational teams.

Experience, Knowledge & Skills

  • Several years of experience in IT operations, service management, or infrastructure management, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Manager.
  • Proven experience in managing high‑availability systems and ensuring operational reliability.
  • Extensive experience in root cause analysis (RCA), incident management, and developing permanent solutions for recurring service disruptions.
  • Hands‑on experience with CI/CD pipelines, automation, system performance monitoring, and infrastructure‑as‑code implementation.
  • Strong background in collaborating with cross‑functional teams (development, operations, engineering, etc.) to improve operational processes and service delivery.
  • Experience in managing deployments, risk assessments, and optimizing event and problem‑management processes.
  • Familiarity with cloud technologies, containerization, and scalable architecture, including zero‑downtime deployment strategies.

Functional Skills

  • Collaboration
  • Stakeholder Management
  • Service Design
  • Project Management
  • Communication
  • Compliance & Risk Management
  • Problem Solving
  • Incident Management
  • Change Management
  • Innovation

Technical Skills

  • Cloud Infrastructure
  • Automation & AI
  • Operations Monitoring & Diagnostics
  • Deployment
  • Programming & Scripting Languages

Educational Background

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field.

Advanced degree (Master’s or equivalent) is often preferred for senior positions.

Qualifications

  • Relevant certifications such as ITIL, CompTIA Security+, or Certified Kubernetes Administrator (CKA).
  • Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies (e.g., Certified DevOps Professional).
  • Certifications in specific tools like ServiceNow, Jira, or other relevant software platforms.

What We Offer

  • Flex Week: Work from home up to 2 days/week (depending on your team's needs).
  • Flex Day: Make your workday suit your life and plans.
  • Flex‑Location: Take up to 30 days a year to work from any location in the world.
  • Employee Wellbeing: Employee Assistance Program (EAP) for you and your dependents 24/7, 365 days/year, plus Champion Health platform.
  • Professional Development: Level up your skills with training platforms, including LinkedIn Learning!
  • Competitive Benefits: Competitive benefits that match local market and employment status.

SITA is an Equal Opportunity Employer. We value a diverse workforce and encourage women, Aboriginal people, and members of visible ...

#J-18808-Ljbffr
Location:
Reading, England, United Kingdom
Salary:
£100,000 - £125,000
Job Type:
FullTime
Category:
Engineering