Skip to content

Health assessments

Health assessments provide a structured, rapid overview of your infrastructure’s state. Instead of asking open-ended questions, a health assessment systematically checks key areas and produces a categorized report.

> /health # 1-hour lookback (default)
> /health 15m # Last 15 minutes
> /health 4h # Last 4 hours
> /health 24h # Last 24 hours

The health assessment runs through six phases:

  1. Orientation — Check topology/service map tables, recent incidents
  2. Metrics sweep — Error rates, latency (p50/p95/p99), saturation (CPU, memory, disk, connection pools)
  3. Logs check — Sample ERROR and FATAL log entries
  4. Traces check — High-duration spans, error status by service
  5. Recent changes — Deployments, config changes, feature flags
  6. Cost signals — High-cardinality metrics, log volume spikes

After the data gathering phases, a senior review is performed focusing on consistency checks, top 3 action items, connected findings, and a 72-hour outlook.

The assessment produces a structured report with categorized findings:

  • Good — What’s healthy, with supporting numbers
  • Bad — Active problems: what’s affected, severity, duration, evidence
  • Ugly — Concerning but not broken: trends, elevated utilization, error elevation
  • Watch Out — Risks and recommendations: capacity, monitoring gaps, single points of failure
  • Cost & Efficiency — High cardinality, log hotspots, large tables
  • Summary Table — Service-by-service status overview

Each finding includes evidence — the specific metric, query result, or log pattern that supports it.

For a dedicated cost analysis, use the /cost command:

> /cost # 24-hour cost analysis (default)
> /cost 7d # 7-day cost analysis

This produces a ranked list of cost drivers separated into quick wins and engineering projects, with ROI estimates.