Question 1

What is OpenSRE?

Accepted Answer

OpenSRE is an open-source AI SRE platform that investigates production incidents autonomously. It uses LangGraph to orchestrate multiple AI agents that gather context from your observability stack, reason about root causes, and produce structured incident reports. Unlike stateless AI tools, OpenSRE has episodic memory that learns from every past investigation, and a Neo4j knowledge graph that maps your service topology.

Question 2

How does OpenSRE investigate incidents?

Accepted Answer

When an alert fires, OpenSRE's planner agent breaks the investigation into parallel subtasks — checking Kubernetes health, querying Prometheus metrics, reading application logs, analyzing distributed traces. Multiple subagent executors run these subtasks simultaneously. A synthesizer combines the findings and produces a structured report with root cause, evidence, and remediation suggestions.

Question 3

What investigation skills does OpenSRE have?

Accepted Answer

OpenSRE has 46 built-in investigation skills covering Kubernetes, Prometheus, Grafana, Datadog, Elastic/ELK, Splunk, Jaeger, Sentry, PagerDuty, Slack, GitHub, and more. Skills are loaded on-demand — agents request the skill they need, preventing context window bloat. You can also add custom skills for tools specific to your stack.

Question 4

How does episodic memory work in OpenSRE?

Accepted Answer

After every investigation, OpenSRE extracts metadata — root cause, alert type, affected services, severity, resolution status — and stores it as an episode. Before each new investigation, it retrieves similar past episodes using weighted similarity scoring (alert type, service overlap, resolution status). This context is injected into the investigation planner, so OpenSRE learns from every incident.

Question 5

What integrations are supported?

Accepted Answer

OpenSRE integrates with Kubernetes, Prometheus, Grafana, Datadog, New Relic, Elastic/ELK, Splunk, Jaeger, Sentry, PagerDuty, OpsGenie, Slack, GitHub, and Confluence. For LLM providers, it uses LiteLLM as a proxy, supporting Anthropic, OpenAI, OpenRouter, and any compatible provider including self-hosted models.

Question 6

Can I add custom investigation skills?

Accepted Answer

Yes. Skills are defined as directories in `.claude/skills/` with a `SKILL.md` describing the skill's tools and context, plus executable scripts. Any tool your engineers use can become an investigation skill — internal monitoring tools, proprietary deployment systems, custom alerting pipelines.

Question 7

How does OpenSRE connect to my observability stack?

Accepted Answer

OpenSRE connects to your stack via integration configuration in `litellm_config.yaml` and the config-service. Each integration has a URL and credentials. During investigations, agents query these integrations directly — Prometheus for metrics, Elasticsearch for logs, Jaeger for traces, Kubernetes API for pod health.

Question 8

How does OpenSRE reduce MTTR?

Accepted Answer

OpenSRE reduces MTTR by automating the investigation phase — the 60-80% of incident resolution time spent gathering context, forming hypotheses, and querying tools. It starts investigating the moment an alert fires, runs multiple investigation threads in parallel, and presents the on-call engineer with a fully investigated situation rather than a blank dashboard.

Question 9

How is OpenSRE different from PagerDuty AI or Rootly AI?

Accepted Answer

OpenSRE is open-source (Apache 2.0) and self-hosted, while PagerDuty and Rootly are commercial SaaS. OpenSRE's key differentiators are episodic memory (it learns from every incident) and a Neo4j knowledge graph (it maps your service topology for blast radius analysis) — neither commercial tool has these. OpenSRE also supports any LLM provider via LiteLLM, not just one fixed model.

Question 10

Is OpenSRE production-ready?

Accepted Answer

OpenSRE is actively used in production environments. It runs as a set of Docker Compose services (or Kubernetes via Helm chart) and has been tested with real production incident investigation scenarios. The core agent pipeline, episodic memory, and knowledge graph components are stable. As with any open-source platform, you're responsible for your own deployment and operations.

Question 11

What's the cost of running OpenSRE?

Accepted Answer

OpenSRE itself is free and open-source. The main cost is infrastructure (running PostgreSQL, Neo4j, and the agent services) and LLM API usage. Infrastructure costs are typically $50-200/month on a small cloud instance. LLM costs depend on investigation volume and which model you use — Claude Haiku or similar efficient models keep costs low.

Question 12

Can OpenSRE replace our on-call rotation?

Accepted Answer

OpenSRE augments on-call engineers, not replaces them. It handles the investigation phase autonomously — gathering context, forming hypotheses, identifying probable root causes. The on-call engineer validates findings and implements fixes. For low-severity, well-understood incidents with clear runbooks, you can configure automated remediation, but human oversight remains for production changes.

Question 13

How do I set up OpenSRE?

Accepted Answer

Clone the repository, create a `.env` file with your `OPENROUTER_API_KEY`, and run `make dev`. This starts all services including PostgreSQL, Neo4j, the sre-agent, and the web console. The web console is available at http://localhost:3002. Full setup guide at /docs/quick-start.

Question 14

What infrastructure does OpenSRE need?

Accepted Answer

OpenSRE requires Docker and Docker Compose for local development. In production, it runs on Kubernetes via a Helm chart. Core dependencies are PostgreSQL (for episodic memory and config), Neo4j (for the knowledge graph), and a LiteLLM instance (for LLM routing). A small setup runs comfortably on 4 vCPUs and 8GB RAM.

Question 15

Does OpenSRE work with Kubernetes?

Accepted Answer

Yes — Kubernetes investigation is one of OpenSRE's primary use cases. The k8s-debug skill checks pod health, restart counts, and OOMKilled events. The k8s-deployments skill monitors rollout status and replica counts. OpenSRE also maps your Kubernetes services in the knowledge graph for dependency tracking and blast radius analysis.

Question 16

What LLM providers does OpenSRE support?

Accepted Answer

OpenSRE supports any LLM provider via LiteLLM. Out of the box, it's configured for OpenRouter, which gives access to Claude (Anthropic), GPT-4 (OpenAI), Llama 3 (Meta), and hundreds of other models. You can switch to direct Anthropic or OpenAI APIs, or run a local model via Ollama — just update `litellm_config.yaml`.

Question 17

Can I run OpenSRE without cloud dependencies?

Accepted Answer

Yes. OpenSRE can run fully air-gapped with local LLMs (via Ollama + LiteLLM), local PostgreSQL, and local Neo4j. All services are containerized. The only external dependencies are your existing monitoring tools (Prometheus, Grafana, etc.) — which you're already running.

Question 18

Is OpenSRE truly open source?

Accepted Answer

Yes. OpenSRE is released under the Apache 2.0 license, one of the most permissive open-source licenses. You can use it for commercial purposes, modify it, and distribute it. The full source code is on GitHub at https://github.com/swapnildahiphale/OpenSRE.

Question 19

What license does OpenSRE use?

Accepted Answer

OpenSRE uses the Apache License 2.0. This is a permissive open-source license that allows you to use, modify, and distribute the software for any purpose, including commercial use, without restriction. You can self-host it in your production environment with no licensing fees.

Question 20

How can I contribute to OpenSRE?

Accepted Answer

Contributions are welcome on GitHub at https://github.com/swapnildahiphale/OpenSRE. You can contribute new investigation skills, improve existing integrations, fix bugs, improve documentation, or share feedback via GitHub Issues. The project welcomes skill contributions especially — if you've integrated OpenSRE with a tool we don't support yet, a pull request is the best way to contribute.

Frequently Asked Questions

For SREs & Platform Engineers