Staff Site Reliability Engineer - Observability

New Today

Staff Site Reliability Engineer - Observability

Motive Group
City of London, London

£68,915 per year
NEW

Senior / Staff Site Reliability Engineer - Observability | London (Hybrid)

If you care deeply about building and operating world‑class infrastructure for AI at scale, this one's worth your time.

We're working with a company that builds the backbone powering some of the most demanding AI workloads on the planet. Think large‑scale GPU clusters, global telemetry systems, and distributed training environments used by leading research and enterprise teams.

They're looking for a Senior or Staff SRE with deep experience in observability at massive scale – someone who's tuned Prometheus / Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast‑moving environments.

You’ll be working on:

  • Designing and scaling observability for globally distributed GPU infrastructure
  • Building automation that cuts operational toil and improves reliability
  • Partnering with platform and infrastructure teams to deliver true visibility across complex AI systems

If you've built or operated telemetry stacks for large‑scale, GPU‑heavy, or multi‑tenant environments – and want to work on cutting‑edge problems in a business growing faster than most can imagine, then this could be your next step.

Location: London (hybrid)

You: 7+ years experience, expert in observability at scale, low ego, high ownership.

Comp: 150-200k + 1-2X salary in equity

#J-18808-Ljbffr
Location:
City Of London, England, United Kingdom
Salary:
£100,000 - £125,000
Job Type:
FullTime
Category:
Engineering

We found some similar jobs based on your search