What is OpenSRE?
OpenSRE is an open-source AI platform that investigates production incidents the way an experienced SRE would — but faster and around the clock.
The Problem
When a production incident hits, engineers scramble to figure out what went wrong. They check dashboards, grep through logs, trace requests, and piece together a timeline. This process is slow, stressful, and depends heavily on tribal knowledge.
How OpenSRE Helps
OpenSRE automates this investigation process. When an alert fires, OpenSRE's AI agents:
- Gather context from your monitoring tools — Prometheus, Grafana, Datadog, Elastic, and more
- Investigate systematically using 46 built-in investigation skills
- Learn from past incidents through episodic memory, getting better over time
- Map service dependencies via a knowledge graph powered by Neo4j
- Produce a detailed report with root cause analysis, timeline, and remediation steps
Key Features
Episodic Memory
Unlike stateless AI tools, OpenSRE remembers past investigations. When a similar incident occurs, it recalls what worked before — the same way a senior engineer builds intuition over years of on-call experience.
Knowledge Graph
OpenSRE maintains a live graph of your service topology. It knows which services depend on which, what changed recently, and how failures propagate through your system.
46 Investigation Skills
From checking Kubernetes pod status to analyzing Prometheus metrics to reading Sentry error traces — OpenSRE has a growing library of investigation skills that it selects based on the incident context.
Open Source
OpenSRE is fully open-source under the Apache 2.0 license. Self-host it in your own infrastructure. Your data stays with you.