Introduction

OpenSRE is an open-source AI SRE platform that investigates production incidents autonomously. When an alert fires, OpenSRE's AI agents gather context from your observability stack, reason about root causes, and produce a detailed incident report — the way an experienced SRE would, but faster and around the clock.

Who is OpenSRE for?

OpenSRE is built for teams that are tired of manual, repetitive incident investigation:

  • Platform engineers and SREs who spend too much time on routine investigations
  • On-call engineers who need help at 3 AM when cognitive load is highest
  • Engineering managers who want to reduce MTTR and reliance on tribal knowledge
  • DevOps teams building their observability practice

Key Capabilities

Autonomous Incident Investigation

When an alert fires, OpenSRE's planner agent breaks the investigation into parallel subtasks. Multiple investigation subagents execute simultaneously, each querying different data sources — Prometheus metrics, Kubernetes pod status, application logs, distributed traces. A synthesizer agent combines the findings and a writeup agent produces a structured report.

Episodic Memory System

OpenSRE remembers past investigations. After every incident, it extracts key metadata — root cause, affected services, alert type, severity — and stores it in its episodic memory. When a similar incident occurs, OpenSRE retrieves relevant past episodes and uses them to guide the new investigation. This is how a senior SRE builds intuition over years of on-call experience, replicated in software.

Knowledge Graph

OpenSRE maintains a live graph of your service topology in Neo4j. It knows which services depend on which, tracks recent deployments, and can perform blast radius analysis — given a failing component, which services are affected? This context is automatically provided to investigation agents.

46 Investigation Skills

OpenSRE comes with 46 built-in investigation skills: checking Kubernetes pod health, querying Prometheus for anomalies, analyzing Grafana dashboards, reading Datadog traces, scanning Sentry errors, and more. Skills are loaded on-demand based on the incident context.

Integrations

Works with: Prometheus, Grafana, Datadog, Elastic/ELK, Splunk, Jaeger, New Relic, Sentry, PagerDuty, Slack, GitHub, Confluence, and Kubernetes.

Architecture at a Glance

OpenSRE uses a graph-based agent orchestration system built on LangGraph. Alerts enter via Slack (through the Slack bot) or directly through the web console. The sre-agent processes investigations and streams results via Server-Sent Events.

Slack  →  slack-bot  →  sre-agent (LangGraph)
Web UI ────────────→         │
                         ┌───┴───┐
                         │   │   │
                      Memory Skills KG

For a deep dive into the architecture, see Architecture.

Open Source

OpenSRE is released under the Apache 2.0 license. Self-host it in your own infrastructure. Your data, your control.

Get started →