What is Episodic Memory in SRE?
Episodic memory is a system that allows AI SRE agents to remember and learn from past investigations. Instead of starting each incident investigation from scratch, an AI agent with episodic memory can recall: "Last time this alert fired on this service, the root cause was X, and we resolved it by doing Y." This accumulated institutional knowledge is what makes AI SRE agents get better over time.
The Problem: Stateless AI Isn't Enough
Most AI tools are stateless. They have knowledge from their training data, but they don't remember your specific systems, your specific incidents, or what's worked for you in the past.
Imagine a senior SRE on their first day at a new company. They know SRE practices well, but they don't know your systems. Now imagine that same SRE after 2 years on-call. They've seen hundreds of incidents, they recognize patterns, they know the quirks of each service. Episodic memory is how we give that accumulated expertise to an AI agent.
How Episodic Memory Works in OpenSRE
Step 1: Investigation Completes
After every incident investigation, the writeup node produces a structured report: what happened, what evidence was found, what the root cause was, what was done to resolve it.
Step 2: Metadata Extraction
An LLM extracts structured metadata from the investigation outcome:
- Summary: 2-3 sentence description of the incident
- Root cause: The identified root cause (e.g., "connection pool exhaustion due to sudden traffic spike")
- Alert type: Category of alert (e.g.,
high_error_rate,pod_crashloop,latency_spike) - Affected services: Which services were involved
- Severity: critical / high / medium / low
- Resolution status: Was it resolved? How?
Step 3: Episode Storage
The episode is stored in PostgreSQL via OpenSRE's config-service. All investigations are stored — not just resolved ones. An unresolved incident with "we don't know yet" is still valuable future context.
Step 4: Similarity Retrieval
Before the next investigation begins, OpenSRE queries episodic memory for similar past episodes. Similarity is computed using weighted scoring:
| Factor | Weight | Rationale | |--------|--------|-----------| | Alert type match | 0.5 | Same category of alert = most relevant | | Service overlap | 0.3 | Same services involved = highly relevant | | Resolution status | 0.2 | Resolved episodes have proven remediation |
Step 5: Context Injection
The top matching episodes are injected into the investigation planner's context. The planner sees concrete, grounded context from your actual environment:
"Similar past episode: On 2026-02-14, payments-service had a high_error_rate alert. Root cause: database connection pool exhaustion during traffic spike after a marketing campaign. Resolved by increasing pool size from 10 to 50 connections and adding circuit breaker. Duration: 23 minutes."
Step 6: Strategy Generation
When 2 or more similar episodes accumulate, OpenSRE automatically generates a reusable investigation strategy. This strategy captures the common investigation path: start with connection pool metrics, check for recent deployments or traffic spikes, query Datadog for database latency, etc.
The Difference From Fine-Tuning
Episodic memory is not fine-tuning. Fine-tuning bakes knowledge into the model weights — it's expensive, requires ML expertise, and goes stale as your systems evolve.
Episodic memory is a retrieval system. Episodes are stored in a database, queried at investigation time, and injected as context. Updating it is as simple as new investigations happening. It stays current automatically.
What Gets Better Over Time
The compound effect of episodic memory:
| Investigations | What changes | |----------------|-------------| | 1-5 | Baseline performance, building history | | 5-10 | Relevant past episodes surface for recurring alert types | | 10+ | Auto-generated strategies for common patterns | | 20+ | High-confidence root cause predictions for known patterns |
Viewing Your Episodic Memory
The OpenSRE web console includes an episodic memory browser. View all stored episodes, filter by service or severity, see which strategies have been generated, and review the context being provided to active investigations.