Architecture

OpenSRE is built on three core systems working together: a LangGraph-orchestrated agent pipeline, an episodic memory system, and a Neo4j knowledge graph. Understanding how these interact explains how OpenSRE investigates incidents.

System Overview

Slack  →  slack-bot (Bolt/Socket Mode)  →  sre-agent (LangGraph)
Web UI ──────────────────────────────→         │
                                           ┌────┴────┐
                                           │    │    │
                                        Memory Skills  KG
                                           │         │
                                       PostgreSQL  Neo4j
config-service ← used by web_ui, slack-bot, sre-agent

Two entry points: Slack (via slack-bot) and the web console (via web_ui). Both stream results via Server-Sent Events from sre-agent.

LangGraph Orchestration

The investigation pipeline is a directed graph with these nodes:

| Node | Role | |------|------| | init_context | Parses the alert, loads episodic memory context | | planner | Breaks the investigation into parallel subtasks | | subagent_executor | Executes one investigation subtask | | synthesizer | Combines findings from all subagents | | writeup | Produces the final incident report | | memory_store | Stores the episode in episodic memory |

The key architectural decision is the Send() fan-out: the planner emits multiple Send("subagent_executor", task) events that execute in parallel. Each subagent has access to 46 investigation skills and runs its subtask independently. This parallel execution is what makes OpenSRE fast.

Data flow:

Alert → init_context → planner → [Send() fan-out]
    ↓
subagent_executor × N (parallel)
    ↓
synthesizer → writeup → memory_store

Episodic Memory System

After every investigation, OpenSRE stores the episode in its episodic memory. The episodic memory lifecycle:

Investigation completes — writeup node produces a structured report
LLM extraction — metadata is extracted: summary, root cause, alert_type, affected services, severity, resolution status
Storage — episode stored in PostgreSQL via config-service API
Retrieval — before the next investigation, init_context queries episodic memory for similar past episodes using weighted scoring: alert_type (0.5), service (0.3), resolved status (0.2)
Context injection — relevant past episodes are injected into the planner's context
Strategy generation — when 2+ episodes share an alert_type, OpenSRE auto-generates reusable investigation strategies

This is what makes OpenSRE get better over time. The first time you see a payments-service outage, it takes longer. The tenth time, it has patterns, root causes, and strategies from past episodes.

Knowledge Graph

OpenSRE maintains a live service topology graph in Neo4j:

Nodes: services, deployments, infrastructure components, teams
Edges: depends-on, owned-by, calls, deployed-to

During investigation, agents can query the graph:

Blast radius analysis: given a failing service, what services depend on it?
Dependency traversal: what does this service depend on that might have caused this?
Ownership lookup: which team owns this service?
Recent change detection: what was deployed recently near this service?

Skills System

The 46 investigation skills are loaded on-demand. When an agent needs to check Kubernetes pod status, it calls load_skill("k8s-debug") which loads the skill's context and tools, then calls run_script to execute specific checks. This progressive loading keeps the agent's context window manageable.

Skills are organized by domain:

Kubernetes: pod status, deployment health, resource usage
Metrics: Prometheus queries, Grafana dashboards, Datadog metrics
Logs: Elastic/ELK log analysis, Splunk search
Traces: Jaeger, Datadog APM, Sentry errors
Infrastructure: DNS, networking, cloud provider health

Service Ports

| Service | Host Port | Description | |---------|-----------|-------------| | PostgreSQL | 5433 | Primary database | | config-service | 8081 | Configuration API | | Neo4j HTTP | 7475 | Neo4j browser | | Neo4j Bolt | 7688 | Neo4j driver connection | | LiteLLM | 4001 | LLM proxy | | sre-agent | 8001 | Investigation agent API | | web-ui | 3002 | Admin console |