SRE & AI Glossary

Definitions of key terms in site reliability engineering and AI-powered operations.

AI SRE

An AI SRE (AI Site Reliability Engineer) is an autonomous software agent that performs incident investigation and response tasks that would otherwise require a human SRE. It gathers context from monitoring tools, reasons about root causes, and produces structured incident reports.

In OpenSRE: OpenSRE is an open-source AI SRE platform. Its agents investigate production incidents using 46 investigation skills across Kubernetes, Prometheus, Grafana, Datadog, and other tools.

Alert Fatigue

Alert fatigue is the desensitization of on-call engineers to alerts due to high volume, frequent false positives, or low-signal notifications. It leads to missed critical alerts, slow response times, and burnout.

In OpenSRE: OpenSRE addresses alert fatigue by automatically triaging and investigating alerts, reducing the cognitive load on on-call engineers and ensuring every alert gets proper investigation.

Blast Radius Analysis

Blast radius analysis is the process of determining which services, systems, or users are affected by a failure or change. In microservices architectures, a single service failure can cascade to many dependent services.

In OpenSRE: OpenSRE performs blast radius analysis using its Neo4j knowledge graph. When a service fails, it traverses the dependency graph to identify all affected downstream services, proactively checking their health during investigation.

Episodic Memory

Episodic memory in AI systems is a mechanism for storing and retrieving memories of specific past events. Unlike semantic memory (general knowledge), episodic memory records what happened, when, and in what context — enabling learning from specific experiences.

In OpenSRE: OpenSRE's episodic memory stores every past investigation as an episode, including root cause, affected services, and resolution details. Before each new investigation, it retrieves similar past episodes to guide the current investigation.

Incident Investigation

Incident investigation is the process of determining the root cause, scope, and timeline of a production incident. It involves gathering evidence from monitoring tools, forming hypotheses, testing them, and documenting findings.

In OpenSRE: Incident investigation is OpenSRE's primary function. Its AI agents conduct the full investigation cycle autonomously — from initial context gathering through root cause identification to report generation.

Investigation Skills

In the context of AI SRE agents, investigation skills are modular capabilities that allow an agent to query specific tools or data sources. Each skill encapsulates the tools and context needed to investigate one domain.

In OpenSRE: OpenSRE has 46 built-in investigation skills covering Kubernetes, Prometheus, Grafana, Datadog, Elastic, Splunk, Jaeger, Sentry, and more. Skills are loaded on-demand during investigations.

Knowledge Graph

A knowledge graph is a database that stores entities (nodes) and their relationships (edges) in graph form. In infrastructure contexts, it typically represents services, their dependencies, ownership, and infrastructure components.

In OpenSRE: OpenSRE maintains a Neo4j-powered knowledge graph of your service topology. It enables blast radius analysis, dependency traversal, and ownership lookup during incident investigation.

LangGraph

LangGraph is an open-source framework for building stateful, multi-agent AI applications. It represents agent workflows as directed graphs, where nodes are processing steps and edges define the flow of information between them.

In OpenSRE: OpenSRE uses LangGraph to orchestrate its investigation pipeline: planner → parallel subagent executors → synthesizer → writeup → episodic memory storage.

MTTR (Mean Time to Resolution)

Mean Time to Resolution (MTTR) is the average time from when an incident is detected to when it is fully resolved. It is a key metric for measuring incident response effectiveness and SRE team performance.

In OpenSRE: Reducing MTTR is OpenSRE's primary value proposition. By automating the investigation phase (typically 60-80% of resolution time), OpenSRE significantly reduces MTTR for production incidents.

Observability

Observability is the ability to understand the internal state of a system from its external outputs. In software engineering, observability is typically achieved through three pillars: logs, metrics, and traces.

In OpenSRE: OpenSRE integrates with your entire observability stack — Prometheus for metrics, Elastic/Splunk for logs, Jaeger/Datadog for traces — querying all three pillars during incident investigation.

On-Call Automation

On-call automation refers to software systems that automatically handle incident response tasks that would otherwise require a human engineer to be paged and manually intervene. It ranges from simple runbook execution to full AI-powered investigation.

In OpenSRE: OpenSRE automates the investigation phase of on-call response. When an alert fires, OpenSRE investigates immediately without requiring an engineer to start the process, reducing time-to-investigation from minutes to seconds.

Root Cause Analysis

Root cause analysis (RCA) is the process of identifying the fundamental reason a problem occurred, rather than just treating its symptoms. In incident management, RCA aims to understand why an incident happened to prevent recurrence.

In OpenSRE: OpenSRE performs automated root cause analysis as part of every investigation, synthesizing evidence from multiple data sources to identify the most probable root cause and supporting evidence.

Runbook Automation

Runbook automation is the process of codifying and automatically executing operational procedures (runbooks) in response to specific events or conditions. It transforms manual, step-by-step procedures into automated workflows.

In OpenSRE: OpenSRE complements runbook automation. While runbook automation handles known, well-defined scenarios, OpenSRE's AI agents handle novel incidents that don't match existing runbooks.

Service Topology

Service topology describes the structure and relationships of services in a distributed system — which services communicate with which, what dependencies exist, and how they're organized. Understanding topology is critical for blast radius analysis and root cause investigation.

In OpenSRE: OpenSRE maps your service topology in a Neo4j knowledge graph, updated continuously with new deployments and dependency changes. This topology context is automatically provided to investigation agents.

SRE Agent

An SRE agent is an autonomous AI system that performs site reliability engineering tasks — monitoring infrastructure health, investigating incidents, executing operational procedures, and maintaining system reliability — with minimal human intervention.

In OpenSRE: The sre-agent is OpenSRE's core component: a LangGraph-orchestrated AI agent that receives alerts, coordinates investigation subagents, and produces incident reports. It's the brain of the OpenSRE platform.