Health assessments
Health assessments provide a structured, rapid overview of your infrastructure’s state. Instead of asking open-ended questions, a health assessment systematically checks key areas and produces a categorized report.
Running a health assessment
Section titled “Running a health assessment”> /health # 1-hour lookback (default)> /health 15m # Last 15 minutes> /health 4h # Last 4 hours> /health 24h # Last 24 hoursWhat it checks
Section titled “What it checks”The health assessment runs through six phases:
- Orientation — Check topology/service map tables, recent incidents
- Metrics sweep — Error rates, latency (p50/p95/p99), saturation (CPU, memory, disk, connection pools)
- Logs check — Sample ERROR and FATAL log entries
- Traces check — High-duration spans, error status by service
- Recent changes — Deployments, config changes, feature flags
- Cost signals — High-cardinality metrics, log volume spikes
After the data gathering phases, a senior review is performed focusing on consistency checks, top 3 action items, connected findings, and a 72-hour outlook.
Report format
Section titled “Report format”The assessment produces a structured report with categorized findings:
- Good — What’s healthy, with supporting numbers
- Bad — Active problems: what’s affected, severity, duration, evidence
- Ugly — Concerning but not broken: trends, elevated utilization, error elevation
- Watch Out — Risks and recommendations: capacity, monitoring gaps, single points of failure
- Cost & Efficiency — High cardinality, log hotspots, large tables
- Summary Table — Service-by-service status overview
Each finding includes evidence — the specific metric, query result, or log pattern that supports it.
Cost analysis
Section titled “Cost analysis”For a dedicated cost analysis, use the /cost command:
> /cost # 24-hour cost analysis (default)> /cost 7d # 7-day cost analysisThis produces a ranked list of cost drivers separated into quick wins and engineering projects, with ROI estimates.