Staff Site Reliability Engineer - Observability
New Today
Staff Site Reliability Engineer - Observability
Motive Group
City of London, London
£68,915 per year
NEW
Senior / Staff Site Reliability Engineer - Observability | London (Hybrid)
If you care deeply about building and operating world‑class infrastructure for AI at scale, this one's worth your time.
We're working with a company that builds the backbone powering some of the most demanding AI workloads on the planet. Think large‑scale GPU clusters, global telemetry systems, and distributed training environments used by leading research and enterprise teams.
They're looking for a Senior or Staff SRE with deep experience in observability at massive scale – someone who's tuned Prometheus / Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast‑moving environments.
You’ll be working on:
- Designing and scaling observability for globally distributed GPU infrastructure
- Building automation that cuts operational toil and improves reliability
- Partnering with platform and infrastructure teams to deliver true visibility across complex AI systems
If you've built or operated telemetry stacks for large‑scale, GPU‑heavy, or multi‑tenant environments – and want to work on cutting‑edge problems in a business growing faster than most can imagine, then this could be your next step.
Location: London (hybrid)
You: 7+ years experience, expert in observability at scale, low ego, high ownership.
Comp: 150-200k + 1-2X salary in equity
- Location:
- City Of London, England, United Kingdom
- Salary:
- £100,000 - £125,000
- Job Type:
- FullTime
- Category:
- Engineering