OpenSRE's investigation skills are modular capabilities that let AI agents query specific parts of your infrastructure. Each skill encapsulates the tools and context needed to investigate one domain — from checking Kubernetes pod health to querying Prometheus metrics to reading Sentry error traces.
When an investigation subagent needs to check Kubernetes pod status, it calls load_skill("k8s-debug"). This loads the skill's tools and domain context into the agent's working context. The agent then calls run_script to execute specific checks within that skill.
This progressive loading approach keeps the agent's context window manageable — skills are loaded on-demand only when needed, not all at once.
| Skill | What it investigates |
|-------|---------------------|
| k8s-debug | Pod status, restart counts, OOMKilled events |
| k8s-deployments | Deployment health, rollout status, replica counts |
| k8s-nodes | Node resource pressure, disk pressure, readiness |
| k8s-resources | CPU/memory requests vs limits, resource quotas |
| k8s-events | Cluster events filtered by namespace and severity |
| Skill | What it investigates |
|-------|---------------------|
| prometheus | PromQL queries, alert history, metric anomalies |
| grafana | Dashboard panels, alert states, annotations |
| datadog-metrics | Metric queries, monitor states, service health |
| new-relic | APM metrics, error rates, throughput |
| Skill | What it investigates |
|-------|---------------------|
| elastic-logs | Elasticsearch log queries, error pattern analysis |
| splunk | Splunk search queries, alert history |
| cloudwatch-logs | AWS CloudWatch log groups and insights |
| Skill | What it investigates |
|-------|---------------------|
| jaeger | Distributed traces, span analysis, service dependencies |
| datadog-apm | APM traces, flamegraphs, service dependencies |
| sentry | Error events, stack traces, release health |
| Skill | What it investigates |
|-------|---------------------|
| pagerduty | Recent incidents, alert history, on-call schedule |
| opsgenie | Alert timeline, team escalations |
| Skill | What it investigates |
|-------|---------------------|
| slack | Recent messages in incident channels, runbook links |
| github | Recent commits, pull requests, deployment markers |
| confluence | Runbooks, service documentation, post-mortems |
| Skill | What it investigates |
|-------|---------------------|
| dns | DNS resolution, TTLs, recent changes |
| networking | Connectivity checks, latency, packet loss |
You can add custom investigation skills for your specific stack. Skills are defined as directories in .claude/skills/ following the skill format:
.claude/skills/
my-custom-skill/
SKILL.md # Skill description and tools
scripts/ # Executable scripts
The SKILL.md file describes what the skill does, what tools it provides, and how to use them. The scripts/ directory contains the executable scripts that the skill invokes.
In multi-team deployments, you can enable or disable specific skills per agent via the configuration system:
{
"agents": {
"my-agent-id": {
"skills": {
"k8s-debug": true,
"splunk": false
}
}
}
}
Disabled skill directories are removed from the agent's working context at session start.