Investigation Skills

OpenSRE's investigation skills are modular capabilities that let AI agents query specific parts of your infrastructure. Each skill encapsulates the tools and context needed to investigate one domain — from checking Kubernetes pod health to querying Prometheus metrics to reading Sentry error traces.

How Skills Work

When an investigation subagent needs to check Kubernetes pod status, it calls load_skill("k8s-debug"). This loads the skill's tools and domain context into the agent's working context. The agent then calls run_script to execute specific checks within that skill.

This progressive loading approach keeps the agent's context window manageable — skills are loaded on-demand only when needed, not all at once.

Skill Categories

Kubernetes

| Skill | What it investigates | |-------|---------------------| | k8s-debug | Pod status, restart counts, OOMKilled events | | k8s-deployments | Deployment health, rollout status, replica counts | | k8s-nodes | Node resource pressure, disk pressure, readiness | | k8s-resources | CPU/memory requests vs limits, resource quotas | | k8s-events | Cluster events filtered by namespace and severity |

Metrics and Monitoring

| Skill | What it investigates | |-------|---------------------| | prometheus | PromQL queries, alert history, metric anomalies | | grafana | Dashboard panels, alert states, annotations | | datadog-metrics | Metric queries, monitor states, service health | | new-relic | APM metrics, error rates, throughput |

Logs

| Skill | What it investigates | |-------|---------------------| | elastic-logs | Elasticsearch log queries, error pattern analysis | | splunk | Splunk search queries, alert history | | cloudwatch-logs | AWS CloudWatch log groups and insights |

Distributed Tracing and APM

| Skill | What it investigates | |-------|---------------------| | jaeger | Distributed traces, span analysis, service dependencies | | datadog-apm | APM traces, flamegraphs, service dependencies | | sentry | Error events, stack traces, release health |

Alerting and Incident Management

| Skill | What it investigates | |-------|---------------------| | pagerduty | Recent incidents, alert history, on-call schedule | | opsgenie | Alert timeline, team escalations |

Communication and Documentation

| Skill | What it investigates | |-------|---------------------| | slack | Recent messages in incident channels, runbook links | | github | Recent commits, pull requests, deployment markers | | confluence | Runbooks, service documentation, post-mortems |

Infrastructure

| Skill | What it investigates | |-------|---------------------| | dns | DNS resolution, TTLs, recent changes | | networking | Connectivity checks, latency, packet loss |

Adding Custom Skills

You can add custom investigation skills for your specific stack. Skills are defined as directories in .claude/skills/ following the skill format:

.claude/skills/
  my-custom-skill/
    SKILL.md       # Skill description and tools
    scripts/       # Executable scripts

The SKILL.md file describes what the skill does, what tools it provides, and how to use them. The scripts/ directory contains the executable scripts that the skill invokes.

Skills Filtering

In multi-team deployments, you can enable or disable specific skills per agent via the configuration system:

{
  "agents": {
    "my-agent-id": {
      "skills": {
        "k8s-debug": true,
        "splunk": false
      }
    }
  }
}

Disabled skill directories are removed from the agent's working context at session start.