Health assessments

Health assessments provide a structured, rapid overview of your infrastructure’s state. Instead of asking open-ended questions, a health assessment systematically checks key areas and produces a categorized report.

Running a health assessment

> /health          # 1-hour lookback (default)
> /health 15m      # Last 15 minutes
> /health 4h       # Last 4 hours
> /health 24h      # Last 24 hours

What it checks

The health assessment runs through six phases:

Orientation — Check topology/service map tables, recent incidents
Metrics sweep — Error rates, latency (p50/p95/p99), saturation (CPU, memory, disk, connection pools)
Logs check — Sample ERROR and FATAL log entries
Traces check — High-duration spans, error status by service
Recent changes — Deployments, config changes, feature flags
Cost signals — High-cardinality metrics, log volume spikes

After the data gathering phases, a senior review is performed focusing on consistency checks, top 3 action items, connected findings, and a 72-hour outlook.

Report format

The assessment produces a structured report with categorized findings:

Good — What’s healthy, with supporting numbers
Bad — Active problems: what’s affected, severity, duration, evidence
Ugly — Concerning but not broken: trends, elevated utilization, error elevation
Watch Out — Risks and recommendations: capacity, monitoring gaps, single points of failure
Cost & Efficiency — High cardinality, log hotspots, large tables
Summary Table — Service-by-service status overview

Each finding includes evidence — the specific metric, query result, or log pattern that supports it.

Cost analysis

For a dedicated cost analysis, use the /cost command:

> /cost           # 24-hour cost analysis (default)
> /cost 7d        # 7-day cost analysis

This produces a ranked list of cost drivers separated into quick wins and engineering projects, with ROI estimates.