Frequently Asked Questions
Everything you need to know about OpenSRE.
For SREs & Platform Engineers
What is OpenSRE?
OpenSRE is an open-source AI SRE platform that investigates production incidents autonomously. It uses LangGraph to orchestrate multiple AI agents that gather context from your observability stack, reason about root causes, and produce structured incident reports. Unlike stateless AI tools, OpenSRE has episodic memory that learns from every past investigation, and a Neo4j knowledge graph that maps your service topology.
How does OpenSRE investigate incidents?
When an alert fires, OpenSRE's planner agent breaks the investigation into parallel subtasks — checking Kubernetes health, querying Prometheus metrics, reading application logs, analyzing distributed traces. Multiple subagent executors run these subtasks simultaneously. A synthesizer combines the findings and produces a structured report with root cause, evidence, and remediation suggestions.
What investigation skills does OpenSRE have?
OpenSRE has 46 built-in investigation skills covering Kubernetes, Prometheus, Grafana, Datadog, Elastic/ELK, Splunk, Jaeger, Sentry, PagerDuty, Slack, GitHub, and more. Skills are loaded on-demand — agents request the skill they need, preventing context window bloat. You can also add custom skills for tools specific to your stack.
How does episodic memory work in OpenSRE?
After every investigation, OpenSRE extracts metadata — root cause, alert type, affected services, severity, resolution status — and stores it as an episode. Before each new investigation, it retrieves similar past episodes using weighted similarity scoring (alert type, service overlap, resolution status). This context is injected into the investigation planner, so OpenSRE learns from every incident.
What integrations are supported?
OpenSRE integrates with Kubernetes, Prometheus, Grafana, Datadog, New Relic, Elastic/ELK, Splunk, Jaeger, Sentry, PagerDuty, OpsGenie, Slack, GitHub, and Confluence. For LLM providers, it uses LiteLLM as a proxy, supporting Anthropic, OpenAI, OpenRouter, and any compatible provider including self-hosted models.
Can I add custom investigation skills?
Yes. Skills are defined as directories in `.claude/skills/` with a `SKILL.md` describing the skill's tools and context, plus executable scripts. Any tool your engineers use can become an investigation skill — internal monitoring tools, proprietary deployment systems, custom alerting pipelines.
How does OpenSRE connect to my observability stack?
OpenSRE connects to your stack via integration configuration in `litellm_config.yaml` and the config-service. Each integration has a URL and credentials. During investigations, agents query these integrations directly — Prometheus for metrics, Elasticsearch for logs, Jaeger for traces, Kubernetes API for pod health.
For Engineering Managers
How does OpenSRE reduce MTTR?
OpenSRE reduces MTTR by automating the investigation phase — the 60-80% of incident resolution time spent gathering context, forming hypotheses, and querying tools. It starts investigating the moment an alert fires, runs multiple investigation threads in parallel, and presents the on-call engineer with a fully investigated situation rather than a blank dashboard.
How is OpenSRE different from PagerDuty AI or Rootly AI?
OpenSRE is open-source (Apache 2.0) and self-hosted, while PagerDuty and Rootly are commercial SaaS. OpenSRE's key differentiators are episodic memory (it learns from every incident) and a Neo4j knowledge graph (it maps your service topology for blast radius analysis) — neither commercial tool has these. OpenSRE also supports any LLM provider via LiteLLM, not just one fixed model.
Is OpenSRE production-ready?
OpenSRE is actively used in production environments. It runs as a set of Docker Compose services (or Kubernetes via Helm chart) and has been tested with real production incident investigation scenarios. The core agent pipeline, episodic memory, and knowledge graph components are stable. As with any open-source platform, you're responsible for your own deployment and operations.
What's the cost of running OpenSRE?
OpenSRE itself is free and open-source. The main cost is infrastructure (running PostgreSQL, Neo4j, and the agent services) and LLM API usage. Infrastructure costs are typically $50-200/month on a small cloud instance. LLM costs depend on investigation volume and which model you use — Claude Haiku or similar efficient models keep costs low.
Can OpenSRE replace our on-call rotation?
OpenSRE augments on-call engineers, not replaces them. It handles the investigation phase autonomously — gathering context, forming hypotheses, identifying probable root causes. The on-call engineer validates findings and implements fixes. For low-severity, well-understood incidents with clear runbooks, you can configure automated remediation, but human oversight remains for production changes.
For DevOps Generalists
How do I set up OpenSRE?
Clone the repository, create a `.env` file with your `OPENROUTER_API_KEY`, and run `make dev`. This starts all services including PostgreSQL, Neo4j, the sre-agent, and the web console. The web console is available at http://localhost:3002. Full setup guide at /docs/quick-start.
What infrastructure does OpenSRE need?
OpenSRE requires Docker and Docker Compose for local development. In production, it runs on Kubernetes via a Helm chart. Core dependencies are PostgreSQL (for episodic memory and config), Neo4j (for the knowledge graph), and a LiteLLM instance (for LLM routing). A small setup runs comfortably on 4 vCPUs and 8GB RAM.
Does OpenSRE work with Kubernetes?
Yes — Kubernetes investigation is one of OpenSRE's primary use cases. The k8s-debug skill checks pod health, restart counts, and OOMKilled events. The k8s-deployments skill monitors rollout status and replica counts. OpenSRE also maps your Kubernetes services in the knowledge graph for dependency tracking and blast radius analysis.
What LLM providers does OpenSRE support?
OpenSRE supports any LLM provider via LiteLLM. Out of the box, it's configured for OpenRouter, which gives access to Claude (Anthropic), GPT-4 (OpenAI), Llama 3 (Meta), and hundreds of other models. You can switch to direct Anthropic or OpenAI APIs, or run a local model via Ollama — just update `litellm_config.yaml`.
Can I run OpenSRE without cloud dependencies?
Yes. OpenSRE can run fully air-gapped with local LLMs (via Ollama + LiteLLM), local PostgreSQL, and local Neo4j. All services are containerized. The only external dependencies are your existing monitoring tools (Prometheus, Grafana, etc.) — which you're already running.
General
Is OpenSRE truly open source?
Yes. OpenSRE is released under the Apache 2.0 license, one of the most permissive open-source licenses. You can use it for commercial purposes, modify it, and distribute it. The full source code is on GitHub at https://github.com/swapnildahiphale/OpenSRE.
What license does OpenSRE use?
OpenSRE uses the Apache License 2.0. This is a permissive open-source license that allows you to use, modify, and distribute the software for any purpose, including commercial use, without restriction. You can self-host it in your production environment with no licensing fees.
How can I contribute to OpenSRE?
Contributions are welcome on GitHub at https://github.com/swapnildahiphale/OpenSRE. You can contribute new investigation skills, improve existing integrations, fix bugs, improve documentation, or share feedback via GitHub Issues. The project welcomes skill contributions especially — if you've integrated OpenSRE with a tool we don't support yet, a pull request is the best way to contribute.