Incident Manager
New Today
Responsibilities:
- Serve as the primary technical point of contact during critical incidents, ensuring rapid resolution and minimal business impact.
- Lead and coordinate cross-functional teams (engineering, support, operations) during incident response, including root cause analysis, mitigation strategies, and post-mortem reviews.
- Monitor service health using tools such as CloudWatch, OpenSearch, Kibana, Grafana, and proactively identify potential issues before they impact customers.
- Troubleshoot and debug production issues in web architecture, microservices, and cloud environments.
- Manage and maintain system reliability by implementing best practices in observability, monitoring, and alerting.
- Collaborate closely with Software Development, Infrastructure, and Operations teams to improve incident response processes and system resilience.
- Manage incidents related to AWS services such as EC2 S3 RDS, DynamoDB, Aurora, Redis, Memcache, Kafka, SNS, SQS, OpenSearch, and Elasticsearch.
- Use Agile tools (Jira, Confluence) to track incident tickets, document resolutions, and maintain a clear audit trail.
- Oversee system and application deployments, supporting automation pipelines (Jenkins, Git).
- Perform Linux/Unix administration tasks as needed during incident investigation and resolution.
- Continuously update and refine incident response playbooks, runbooks, and SOPs.
- Provide regular incident reports to leadership, including root cause analysis and long-term corrective actions.
Requirements:
- Proven experience as an Incident Manager, Site Reliability Engineer (SRE), or Technical Operations Lead in cloud-native and microservices-based environments.
- Strong understanding of web architecture and microservices development principles.
- Deep hands-on experience with AWS Cloud Services: Compute (EC2 Lambda), Storage (S3), Databases (DynamoDB, RDS, Aurora), Messaging (Kafka, SNS, SQS), Caching (Redis, Memcache), Search (OpenSearch, Elasticsearch).
- Expertise in Agile tools: Jira, Confluence, Git, Jenkins.
- Strong Linux / Unix system administration skills, including troubleshooting and performance tuning.
- Strong analytical skills with expertise in debugging complex distributed system issues.
- Experience with monitoring and observability tools like CloudWatch, Grafana, Nagios, and Kibana.
- Excellent communication and leadership skills to manage cross-functional incident response teams.
- Experience in writing detailed post-incident reports and driving continuous improvement.
- Strong scripting skills (Python, Bash, or similar) to automate diagnostic or remediation tasks.
- Location:
- London, England, United Kingdom
- Salary:
- £100,000 - £125,000
- Job Type:
- FullTime
- Category:
- Management & Operations
We found some similar jobs based on your search
-
New Today
Incident Manager
-
London, England, United Kingdom
-
£100,000 - £125,000
- Management & Operations
Responsibilities: Serve as the primary technical point of contact during critical incidents, ensuring rapid resolution and minimal business impact. Lead and coordinate cross-functional teams (engineering, support, operations) during incident response...
More Details -
-
New Today
Senior Product Manager - Managed Detection and Incident Response (f/m/x)
-
London, England, United Kingdom
-
£125,000 - £150,000
- Management & Operations
About Eye Security Eye Security is providing cybersecurity with embedded cyber insurance solutions for organizations in Europe. Headquartered in the Netherlands, we are already over 170 FTEs and continue to grow internationally. We combine cutting-...
More Details -
-
New Today
Incident Manager
-
United Kingdom
-
£80,000 - £100,000
- Management & Operations
Overview Experienced Group (Business & IT) Incident Manager required for financial services client. The role is focused on managing and coordinating high-impact incidents, driving service restoration, and ensuring clear communication with stakeholde...
More Details -
-
2 Days Old
Cyber Incident Manager
-
West Midlands Combined Authority, England, United Kingdom
-
£80,000 - £100,000
- Management & Operations
Overview Cyber Incident Manager — £60,000 - £70,000 + bonus + extensive benefits. Full Time / Permanent. Hybrid / West Midlands - 1 day a month in the office. The Role and Company I am looking for a driven Cyber Incident Manager to join a large nati...
More Details -
-
2 Days Old
Senior Problem And Major Incident Manager
-
London, England, United Kingdom
-
£150,000 - £200,000
- IT & Technology
Overview Major Incident Manager and Senior Problem Manager Are you an experienced Incident / Problem Manager seeking a new exciting opportunity? If so please apply today! My client is a leading professional services organisation looking for a...
More Details -
-
5 Days Old
Senior Problem and Major Incident Manager
-
London, England, United Kingdom
-
£150,000 - £200,000
- IT & Technology
Senior Problem and Major Incident Manager Join to apply for the Senior Problem and major Incident Manager role at Experis UK. You will have working experience of the ITIL framework and ideally certified in ITIL V4. Must be a self-starter with excellent stakeholder management.
More Details -