Running Investigations
Running Investigations
Section titled “Running Investigations”Master the art of AI-powered production operations with NeuBird MCP. This guide covers both alert-based and manual investigations.
Quick Reference
Section titled “Quick Reference”graph LR
A[List Uninvestigated] --> B[Investigate Alert]
B --> C[Monitor Progress]
C --> D[Get RCA]
D --> E[Implement Actions]
E --> F[Follow-up Questions]
F --> G[Close Incident]
Finding Alerts to Investigate
Section titled “Finding Alerts to Investigate”List Uninvestigated Alerts
Section titled “List Uninvestigated Alerts”Ask Claude:
Show me uninvestigated alerts from the last 24 hoursUses neubird_list_sessions with only_uninvestigated=true.
Filtering options:
Show me uninvestigated P1 alerts from the last 7 daysFind uninvestigated database alertsShow me alerts from api-service in the last hourGetting Links to Investigations
Section titled “Getting Links to Investigations”After listing investigations, you can request a direct link to view them in the NeuBird web interface:
By list number:
Link for #2By session ID:
Link for sessionID abc-123-def-456This provides a URL to open the investigation directly in your browser, useful for:
- Sharing investigations with team members
- Viewing investigations in the full web interface
- Bookmarking specific investigations
- Including in incident reports or documentation
Understanding Alert Information
Section titled “Understanding Alert Information”Each alert shows:
| Field | Description |
|---|---|
| Alert ID | Unique identifier for investigation |
| Title | Incident description |
| Severity | P1 (critical) to P4 (low) |
| Timestamp | When alert fired |
| Source | Monitoring tool (CloudWatch, Datadog, etc.) |
Important: The alert_id (shown as incident_info.id) is what you use to start an investigation.
Starting an Investigation
Section titled “Starting an Investigation”Basic Investigation
Section titled “Basic Investigation”Investigate alert ID: /subscriptions/.../alerts/cpu-spike-123Or simply:
Investigate the first alertClaude will:
- Extract the
alert_id - Call
neubird_investigate_alert - Monitor progress using
neubird_get_investigation_status - Retrieve RCA when complete
Manual Investigations
Section titled “Manual Investigations”Create investigations from custom prompts without needing an existing alert.
When to Use Manual Investigations
Section titled “When to Use Manual Investigations”- Proactive analysis - Investigate potential issues before they become alerts
- Historical research - Analyze past incidents not captured as alerts
- What-if scenarios - Test hypothetical situations
- Training - Create example investigations for documentation
- Testing - Validate system behavior without waiting for real alerts
Creating a Manual Investigation
Section titled “Creating a Manual Investigation”Good prompt (specific):
Investigate high latency in payment-service between 2pm-3pm EST on Jan 15, 2025.Users reported checkout taking 30+ seconds instead of the usual 2 seconds.Bad prompt (too vague):
Something went wrong with the payment serviceClaude will use neubird_create_manual_investigation:
✓ Created manual investigationSession UUID: abc-123-def-456Status: Running
Investigation will complete in 2-5 minutes.Use neubird_get_rca to retrieve results.Prompt Quality Guidelines
Section titled “Prompt Quality Guidelines”✅ Include these details:
- Service/Resource names - Which components are involved?
- Time frame - When did this occur or should be analyzed?
- Symptoms - What behavior are you seeing?
- Context - Any additional relevant information?
❌ Avoid:
- Vague descriptions (“something broke”)
- No timeframe (“recently”, “earlier”)
- Missing service names (“the API”, “the database”)
Example prompts:
Investigate memory leak in user-api pods in production namespace.Started around 8am UTC today, memory usage climbing from 500MB to 2GB.Analyze database connection pool exhaustion in postgres-primary.Between 1pm-2pm EST yesterday. Connection count hit max of 100.Check for cascading failures in microservices during deployment.January 10, 2025 at 3:15pm UTC. Multiple services returned 503 errors.Agent Behavior
Section titled “Agent Behavior”If your prompt is too vague, Claude will ask clarifying questions:
You: Investigate the payment issue
Claude: I need more details to create an effective investigation:- Which payment service? (payment-api, payment-processor, etc.)- When did this occur? (specific date/time or time range)- What symptoms did you observe? (errors, slow response, failures?)- Any other context? (deployment, traffic spike, etc.)Workflow
Section titled “Workflow”Complete manual investigation workflow:
1. Create investigation Investigate API latency spike in checkout-service on Jan 15 between 2-3pm EST
2. Wait for completion (2-5 minutes) Check status with: What's the status of this investigation?
3. Get RCA Show me the root cause analysis
4. Ask follow-ups What caused the latency spike? Were there any related infrastructure changes?Parameters
Section titled “Parameters”The tool accepts:
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt | string | required | Investigation description (min 10 chars) |
project_uuid | string | optional | Uses default project if not specified |
Using Default Project
If you’ve set a default project with neubird_set_default_project, you don’t need to specify project_uuid. Manual investigations will automatically use your default project.
Comparison with Alert-Based Investigations
Section titled “Comparison with Alert-Based Investigations”| Feature | Alert-Based | Manual |
|---|---|---|
| Trigger | Existing alert | Custom prompt |
| Use Case | React to alerts | Proactive analysis |
| Alert ID | Required | Not needed |
| Prompt | Auto-generated | You provide |
| Flexibility | Limited to alert data | Any scenario |
Understanding the RCA
Section titled “Understanding the RCA”RCA Structure
Section titled “RCA Structure”Every RCA includes:
1. Incident Summary
- What happened
- When it happened
- Current status
2. Timeline
- Chronological event sequence
- Key state changes
- System behaviors
3. Root Cause
- Why it happened
- Contributing factors
- Technical details
4. Corrective Actions
- Immediate fixes (often auto-executed)
- Manual actions needed
- Ready-to-execute bash scripts
5. Time Savings
- Manual investigation time estimate
- Actual NeuBird time
- Time saved
Asking Follow-Up Questions
Section titled “Asking Follow-Up Questions”Common Follow-Ups
Section titled “Common Follow-Ups”Understanding the cause:
Why did this happen?Has this happened before?What changed recently?Investigation details:
What data sources were checked?Show me the logs that were analyzedWhat queries were run?Prevention:
How can we prevent this?What monitoring should we add?What tests would catch this?Similar incidents:
Have we seen similar issues?Is this related to recent deployments?Are other services affected?Follow-Up Workflow
Section titled “Follow-Up Workflow”# After getting RCAShow me the chain of thought for this investigationUses neubird_get_chain_of_thought to show reasoning steps.
What data sources were consulted?Uses neubird_get_investigation_sources.
What queries were executed?Uses neubird_get_investigation_queries.
What are suggested follow-up questions?Uses neubird_get_follow_up_suggestions.
Implementing Corrective Actions
Section titled “Implementing Corrective Actions”One of the most powerful features of NeuBird MCP is that you can ask your AI coding agent to directly implement the recommended fixes. Your agent has full context of both your codebase and NeuBird’s analysis.
The Power of MCP Integration
Section titled “The Power of MCP Integration”Because NeuBird runs as an MCP server, your coding agent (Claude Code, Claude Desktop, Cursor, etc.) can:
- Get the RCA with detailed corrective actions
- Understand your codebase through file access
- Implement fixes directly in your code
- Execute commands through its shell access
- Verify the fix by checking logs and metrics
This creates a seamless incident response workflow where investigation and remediation happen in the same conversation.
Workflow: From Investigation to Implementation
Section titled “Workflow: From Investigation to Implementation”Step 1: Get the RCA
Show me the RCA for this investigationStep 2: Ask your agent to implement the fix
Please implement the corrective actions from this RCA.Start with the highest priority items.Step 3: Your agent will:
- Read the relevant code files
- Make the necessary changes
- Run tests to verify
- Execute deployment commands (if appropriate)
- Confirm the changes
Example: Implementing a Database Index
Section titled “Example: Implementing a Database Index”Investigation finds: Missing index causing slow queries
RCA provides:
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);You ask your agent:
Please add this database index. Use a migration filefollowing our project's migration pattern.Your agent will:
- Check existing migration files to understand the pattern
- Create a new migration file (e.g.,
20250124_add_orders_user_id_index.sql) - Add the CREATE INDEX statement with proper syntax
- Add a corresponding DROP INDEX for rollback
- Update the migration tracking
- Suggest running the migration command
Example: Fixing a Configuration Issue
Section titled “Example: Fixing a Configuration Issue”Investigation finds: Connection pool too small for traffic
RCA provides:
kubectl set env deployment/api-service MAX_CONNECTIONS=200You ask your agent:
Please update the MAX_CONNECTIONS configuration to 200.Update both the deployment manifest and our docs.Your agent will:
- Find the deployment YAML file
- Update the environment variable
- Update documentation to reflect the change
- Suggest applying with
kubectl apply -f deployment.yaml - Offer to add monitoring for connection pool usage
Example: Implementing a Code Fix
Section titled “Example: Implementing a Code Fix”Investigation finds: Memory leak from unclosed database connections
RCA explains: Connections not being released in error paths
You ask your agent:
Please fix the connection leak in the payment service.Make sure all code paths properly close connections.Your agent will:
- Read the payment service code
- Identify all database connection usage
- Add proper try/finally blocks or context managers
- Ensure connections are closed in error paths
- Add tests to verify the fix
- Run existing tests to ensure no regressions
Preventive Measures
Section titled “Preventive Measures”After fixing the immediate issue, implement preventive measures:
Please implement the preventive measures from the RCA:1. Add monitoring for connection pool usage2. Add alerts for connection pool >80% full3. Update our runbook with this scenarioBest Practices for Agent-Driven Remediation
Section titled “Best Practices for Agent-Driven Remediation”✅ Do:
- Review the RCA thoroughly before asking for implementation
- Be specific about your project’s patterns and conventions
- Ask the agent to run tests after making changes
- Request documentation updates alongside code changes
- Have the agent explain what it’s doing
- Use staging environments for testing changes
❌ Don’t:
- Blindly apply changes to production without review
- Skip testing steps
- Implement changes you don’t understand
- Ignore the agent’s warnings or questions
- Rush through validation steps
Manual Review Required
Section titled “Manual Review Required”Some actions require human judgment and shouldn’t be fully automated:
Deployment decisions:
- Rolling back to previous versions
- Scaling production resources
- Database schema changes
- Infrastructure modifications
For these, ask your agent to:
Please prepare the rollback command but don't execute it yet.Show me what will happen and wait for my approval.Bash Scripts in RCA
Section titled “Bash Scripts in RCA”RCAs include ready-to-run commands for immediate fixes:
# Example: Add database indexpsql -h prod-db.rds.amazonaws.com -U admin -d production
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);You can execute these directly or ask your agent to integrate them into your workflow:
Please add this as a migration following our standard process.Validation Checklist
Section titled “Validation Checklist”Before executing any fix, verify:
- Commands are safe (no destructive operations)
- Correct environment (prod vs staging)
- Proper permissions and access
- Backup/rollback plan in place
- Tests pass after changes
- Changes follow project conventions
- Documentation is updated
Advanced Investigation Techniques
Section titled “Advanced Investigation Techniques”Filtering by Severity
Section titled “Filtering by Severity”Show me only P1 and P2 uninvestigated alertsFiltering by Date Range
Section titled “Filtering by Date Range”Show me uninvestigated alerts from January 1-15Searching by Keyword
Section titled “Searching by Keyword”Find alerts related to "database timeout"Uses search_term parameter in neubird_list_sessions.
Compact Mode
Section titled “Compact Mode”For large result sets:
Show me all uninvestigated alerts in compact formatReturns minimal details for faster browsing.
Investigation Status Tracking
Section titled “Investigation Status Tracking”Check ongoing investigations:
Show me the status of session abc-123-def-456Uses neubird_get_investigation_status.
Real-Time Progress Monitoring
Section titled “Real-Time Progress Monitoring”NeuBird provides live streaming updates during investigations, giving you visibility into what’s happening at each step.
Understanding Progress Updates
Section titled “Understanding Progress Updates”When an investigation is running, neubird_get_investigation_status provides:
Current Progress:
- progress_percentage: 0-100 indicating completion based on completed steps
- current_step: Human-readable description of what’s happening now
- status: Investigation state (in_progress, completed, failed)
Investigation Breakdown:
- total_steps: Estimated number of investigation steps
- completed_steps: How many steps have finished
- unique_sources: All data sources consulted (logs, metrics, alarms)
Step Details: Each step includes:
- step_id: Unique identifier for detailed lookups
- step_number: Sequential step number
- description: What question is being answered
- category: Type of investigation step (see below)
- status: Step state (in_progress, completed, possible_question)
- sources_consulted: Which data sources contributed to this step
Investigation Step Categories
Section titled “Investigation Step Categories”Each step is categorized by its purpose:
| Category | Purpose | Example |
|---|---|---|
| discovery | Initial fact-finding and data gathering | ”What was the system.load.1 metric value?“ |
| analysis | Examining patterns and correlations | ”Analyzing correlation between CPU and memory” |
| diagnosis | Root cause identification | ”Identifying why the process failed” |
| remediation | Solution recommendations | ”Suggesting corrective actions” |
| validation | Verification of findings | ”Confirming the root cause hypothesis” |
Monitoring Recommendations
Section titled “Monitoring Recommendations”Polling frequency:
- Check status every 10-15 seconds for updates
- More frequent polling won’t speed up investigation
- Less frequent means you might miss interim progress
What to watch:
Show me the investigation status
# Look for:# - progress_percentage: How far along (0-100)# - current_step: What's happening right now# - unique_sources: Which data sources have been checked# - completed_steps / total_steps: Step progressData Source Attribution
Section titled “Data Source Attribution”The unique_sources field shows exactly which monitoring systems contributed to the investigation:
Example sources:
unique_sources: [ "log_datadog.datadog_logs", "monitor_datadog.monitor_events", "alarm_aws_poc_sandbox.alarm_history", "log_aws_poc_sandbox.log_aws_containerinsights_eks_htm_prod", "metric_aws_poc_sandbox.cloudwatch_metrics"]Why this matters:
- Verify comprehensive coverage - Ensure all relevant systems were checked
- Understand thoroughness - More sources = more thorough analysis
- Identify gaps - Missing expected sources? May indicate sync issues
- Debug investigations - See exactly what data was available
Progress Example
Section titled “Progress Example”Starting an investigation:
{ "session_uuid": "abc123...", "status": "in_progress", "progress_percentage": 0, "current_step": "Initializing investigation...", "investigation_summary": { "total_steps": 5, "completed_steps": 0, "total_sources_consulted": 0, "unique_sources": [] }}Mid-investigation:
{ "session_uuid": "abc123...", "status": "in_progress", "progress_percentage": 45, "current_step": "🔍 Discovery: Analyzing system.load.1 metrics across production hosts", "investigation_summary": { "total_steps": 5, "completed_steps": 2, "total_sources_consulted": 8, "unique_sources": [ "log_datadog.datadog_logs", "monitor_datadog.monitor_events", "alarm_aws_poc_sandbox.alarm_history" ], "steps": [ { "step_id": "69247c6b517e7056d602abd1", "step_number": 1, "description": "What was the system.load.1 metric value and which hosts experienced load average above 2.0?", "category": "discovery", "status": "completed", "sources_consulted": [ "monitor_datadog.monitor_events", "log_datadog.datadog_logs" ] }, { "step_id": "69247c6b517e7056d602abd2", "step_number": 2, "description": "Analyzing CPU and memory usage patterns during the incident", "category": "analysis", "status": "in_progress", "sources_consulted": [ "alarm_aws_poc_sandbox.alarm_history" ] } ] }}Typical Investigation Flow
Section titled “Typical Investigation Flow”Phase 1: Discovery (0-30%)
- Gathering initial facts
- Identifying affected resources
- Collecting metrics and logs
- Timeline construction
Phase 2: Analysis (30-60%)
- Pattern detection
- Correlation analysis
- Comparing with baselines
- Examining related events
Phase 3: Diagnosis (60-85%)
- Root cause identification
- Validating hypotheses
- Determining contributing factors
- Impact assessment
Phase 4: Remediation (85-100%)
- Generating corrective actions
- Creating preventive measures
- Calculating time savings
- Finalizing RCA report
Common Progress Patterns
Section titled “Common Progress Patterns”Fast start, then slower:
0% → 40% (first minute) → 60% (next 2 minutes) → 100%Initial discovery is quick, detailed analysis takes longer.
Steady progress:
0% → 20% → 40% → 60% → 80% → 100% (evenly paced)Indicates straightforward incident with clear data.
Stuck at percentage:
0% → 45% (stays here for 1-2 minutes) → 100%Normal for complex data queries or correlation analysis.
Investigation Quality
Section titled “Investigation Quality”Quality Scoring
Section titled “Quality Scoring”Every investigation gets scored on:
- Accuracy (Root cause correctness, impact analysis)
- Completeness (Data coverage, remediation steps)
Check quality:
Show me the quality score for this investigationUses neubird_get_rca_score.
Good scores: 85-100 (Excellent) Acceptable: 70-84 (Good) Needs improvement: <70 (Add more instructions)
Improving Quality
Section titled “Improving Quality”Add context instructions:
Create a SYSTEM instruction about our architectureAdd investigation steps:
Create an RCA instruction for database issuesProvide feedback: If an RCA is incorrect, add an instruction to guide future investigations.
Common Investigation Patterns
Section titled “Common Investigation Patterns”Pattern 1: API Latency
Section titled “Pattern 1: API Latency”Symptoms: Slow API responses, timeouts
Investigation approach:
- Check APM traces (Datadog, New Relic)
- Review database query performance
- Check external API dependencies
- Examine cache hit rates
- Review resource utilization
Common causes:
- Missing database indexes
- Inefficient queries
- External API slowdowns
- Cache misses
- Insufficient resources
Pattern 2: Memory Leaks
Section titled “Pattern 2: Memory Leaks”Symptoms: Increasing memory, eventual OOM crash
Investigation approach:
- Review memory growth timeline
- Check for unclosed connections
- Look for accumulating caches
- Examine heap dumps
- Review recent code changes
Common causes:
- Unclosed database connections
- Event listener leaks
- Growing in-memory caches
- Circular references
Pattern 3: Database Issues
Section titled “Pattern 3: Database Issues”Symptoms: Slow queries, connection timeouts
Investigation approach:
- Check slow query logs
- Review connection pool metrics
- Examine table locks
- Check index usage
- Review database resource metrics
Common causes:
- Missing indexes
- Lock contention
- Connection pool exhaustion
- Long-running transactions
- Inefficient queries
Pattern 4: Deployment Failures
Section titled “Pattern 4: Deployment Failures”Symptoms: Service crashes after deployment
Investigation approach:
- Compare deployment manifests
- Review container startup logs
- Check environment variables
- Examine health check failures
- Review resource limits
Common causes:
- Configuration errors
- Missing environment variables
- Insufficient resources
- Database migration failures
- Incompatible dependencies
Workflow Tips
Section titled “Workflow Tips”Morning Review
Section titled “Morning Review”# Check overnight incidentsShow me uninvestigated alerts from the last 12 hours
# Investigate critical onesInvestigate all P1 alertsIncident Response
Section titled “Incident Response”# During active incidentShow me alerts from the last 30 minutes
# Quick investigationInvestigate this alert and wait for results
# Get actionable fixesShow me the corrective actionsPost-Mortem
Section titled “Post-Mortem”# Review incidentShow me the complete RCA for session abc-123
# Understand what happenedShow me the timeline and chain of thought
# Prevent recurrenceWhat preventive measures are recommended?Troubleshooting Investigations
Section titled “Troubleshooting Investigations”Investigation Taking Too Long
Section titled “Investigation Taking Too Long”Normal duration: 30-90 seconds
If longer:
- First investigation may take 5-10 minutes (syncing connections)
- Complex incidents with many data sources take longer
- Check connection sync status
Incomplete RCA
Section titled “Incomplete RCA”If RCA lacks details:
- Add more SYSTEM instructions with context
- Add RCA instructions with investigation steps
- Ensure connections are properly synced
- Check data source availability
Incorrect Root Cause
Section titled “Incorrect Root Cause”If RCA misidentifies the cause:
- Use follow-up questions to guide investigation
- Add instruction with correct approach
- Test instruction on past session
- Add to project if improved
Next Steps
Section titled “Next Steps”- Use Instructions
Learn to create and test investigation instructions
- Manage Connections
Add and configure data sources
- Advanced Workflows
Power user techniques