Software Engineering Manager, Site Reliability, Cloud Incident Response

2 Days Old

Overview

Software Engineering Manager, Site Reliability, Cloud Incident Response – Google Cloud, London, UK.

Responsibilities

  • Participate in on-call rotation supporting critical incident response for GCP.
  • Focus on high-quality customer outcomes and continued collaboration across GCP teams.
  • Create Incident Management at Google (IMG) training and processes for incident management life cycle, partnering with Cloud SRE Uber Tech Leads and Cloud Support leadership.
  • Build systems and tooling to support the team, improve visibility, detect issues, and communicate with customers, stakeholders, and customer-facing teams.
  • Define and escalate risks in Cloud, reducing incident probabilities with strategic and pragmatic approaches as needed.

Qualifications

  • Bachelor's degree or equivalent practical experience.
  • 3 years of experience in a technical leadership role; overseeing projects, with 2 years of experience in a people management, supervision/team leadership role.
  • Experience with cloud services, telemetry systems and incident response.
  • Master's degree or PhD in Computer Science, or a related technical field (preferred).
  • Experience as a cloud customer (preferred).

About the job

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services—both our internally critical and externally visible systems—have reliability, uptime appropriate to customer needs, and a fast rate of improvement. SREs monitor system capacity and performance.

Much of our software development focuses on optimizing existing systems, building infrastructure, and eliminating work through automation. On the SRE team, you’ll manage the complex challenges of scale unique to Google Cloud, while applying your expertise in coding, algorithms, complexity analysis, and large-scale system design. Our culture emphasizes intellectual curiosity, problem solving, collaboration, and a blame-free environment that supports learning and growth.

The Cloud Incident Response Team supports responders, tooling, and outcomes for Google Cloud Platform (GCP) major incidents. The team collaborates across GCP products, customer-facing teams, and a wide range of stakeholders to coordinate, mitigate, or resolve issues across all of GCP.

Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions and tools that help developers build more sustainably. Customers in more than 200 countries and territories rely on Google Cloud as their trusted partner to enable growth and solve critical business problems.

Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce representative of the users we serve, creating a culture of belonging, and providing equal employment opportunities regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition, criminal histories consistent with legal requirements, or any other basis protected by law. See also Google’s EEO Policy, Know your rights: workplace discrimination is illegal, Belonging at Google, and How we hire.

Google is a global company and, to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting.

To all recruitment agencies: Google does not accept agency resumes. Please do not forward resumes to our jobs alias, Google employees, or any other organization location. Google is not responsible for any fees related to unsolicited resumes.

#J-18808-Ljbffr
Location:
London, England, United Kingdom
Salary:
£200,000 +
Job Type:
FullTime
Category:
IT & Technology

We found some similar jobs based on your search