AI-Powered Incident Investigation: How It Works
Traditional incident response is reactive. An alert fires, a human gets paged, and the investigation begins from scratch. OpenSRE changes this by deploying AI agents that investigate incidents the moment they're detected.
The Investigation Pipeline
When an alert reaches OpenSRE, here's what happens:
1. Context Gathering
The system pulls in all available context: the alert payload, recent deployments, service health metrics, and any related incidents from its episodic memory. It already knows the topology of your system through its knowledge graph.
2. Planning
A planner agent analyzes the context and decides which investigation paths to pursue. It selects from 46 available investigation skills based on the alert type, affected services, and what's worked in similar situations before.
3. Parallel Investigation
Multiple investigation subagents fan out to gather evidence simultaneously. One might check Kubernetes pod status while another queries Prometheus metrics and a third reads application logs. This parallel approach dramatically reduces investigation time.
4. Synthesis
A synthesizer agent collects all findings, correlates the evidence, and identifies the most likely root cause. It considers multiple hypotheses and weighs the evidence for each.
5. Report Generation
The final output is a structured incident report: a timeline of events, root cause analysis, blast radius assessment, and recommended remediation steps.
Why Memory Matters
The key differentiator is episodic memory. After each investigation, OpenSRE stores the full episode — what was investigated, what was found, and what the resolution was. When a similar incident occurs, the system retrieves relevant episodes and applies those learnings.
This is how experienced SREs operate. They don't start from zero each time. They recognize patterns, remember past incidents, and know which diagnostic steps are most likely to be productive. OpenSRE codifies this process.
Built for Your Stack
OpenSRE integrates with the tools you already use: Prometheus, Grafana, Datadog, Elastic, Splunk, PagerDuty, Slack, Kubernetes, and more. It meets your infrastructure where it is.