Running Investigations

Master the art of AI-powered production operations with NeuBird MCP. This guide covers both alert-based and manual investigations.

Quick Reference

graph LR
    A[List Uninvestigated] --> B[Investigate Alert]
    B --> C[Monitor Progress]
    C --> D[Get RCA]
    D --> E[Implement Actions]
    E --> F[Follow-up Questions]
    F --> G[Close Incident]

Finding Alerts to Investigate

List Uninvestigated Alerts

Ask Claude:

Show me uninvestigated alerts from the last 24 hours

Uses neubird_list_sessions with only_uninvestigated=true.

Filtering options:

Show me uninvestigated P1 alerts from the last 7 days

Find uninvestigated database alerts

Show me alerts from api-service in the last hour

Getting Links to Investigations

After listing investigations, you can request a direct link to view them in the NeuBird web interface:

By list number:

Link for #2

By session ID:

Link for sessionID abc-123-def-456

This provides a URL to open the investigation directly in your browser, useful for:

Sharing investigations with team members
Viewing investigations in the full web interface
Bookmarking specific investigations
Including in incident reports or documentation

Understanding Alert Information

Each alert shows:

Field	Description
Alert ID	Unique identifier for investigation
Title	Incident description
Severity	P1 (critical) to P4 (low)
Timestamp	When alert fired
Source	Monitoring tool (CloudWatch, Datadog, etc.)

Important: The alert_id (shown as incident_info.id) is what you use to start an investigation.

Starting an Investigation

Basic Investigation

Investigate alert ID: /subscriptions/.../alerts/cpu-spike-123

Or simply:

Investigate the first alert

Claude will:

Extract the alert_id
Call neubird_investigate_alert
Monitor progress using neubird_get_investigation_status
Retrieve RCA when complete

Manual Investigations

Create investigations from custom prompts without needing an existing alert.

When to Use Manual Investigations

Proactive analysis - Investigate potential issues before they become alerts
Historical research - Analyze past incidents not captured as alerts
What-if scenarios - Test hypothetical situations
Training - Create example investigations for documentation
Testing - Validate system behavior without waiting for real alerts

Creating a Manual Investigation

Good prompt (specific):

Investigate high latency in payment-service between 2pm-3pm EST on Jan 15, 2025.
Users reported checkout taking 30+ seconds instead of the usual 2 seconds.

Bad prompt (too vague):

Something went wrong with the payment service

Claude will use neubird_create_manual_investigation:

✓ Created manual investigation
Session UUID: abc-123-def-456
Status: Running

Investigation will complete in 2-5 minutes.
Use neubird_get_rca to retrieve results.

Prompt Quality Guidelines

✅ Include these details:

Service/Resource names - Which components are involved?
Time frame - When did this occur or should be analyzed?
Symptoms - What behavior are you seeing?
Context - Any additional relevant information?

❌ Avoid:

Vague descriptions (“something broke”)
No timeframe (“recently”, “earlier”)
Missing service names (“the API”, “the database”)

Example prompts:

Investigate memory leak in user-api pods in production namespace.
Started around 8am UTC today, memory usage climbing from 500MB to 2GB.

Analyze database connection pool exhaustion in postgres-primary.
Between 1pm-2pm EST yesterday. Connection count hit max of 100.

Check for cascading failures in microservices during deployment.
January 10, 2025 at 3:15pm UTC. Multiple services returned 503 errors.

Agent Behavior

If your prompt is too vague, Claude will ask clarifying questions:

You: Investigate the payment issue

Claude: I need more details to create an effective investigation:
- Which payment service? (payment-api, payment-processor, etc.)
- When did this occur? (specific date/time or time range)
- What symptoms did you observe? (errors, slow response, failures?)
- Any other context? (deployment, traffic spike, etc.)

Workflow

Complete manual investigation workflow:

1. Create investigation
   Investigate API latency spike in checkout-service on Jan 15 between 2-3pm EST

2. Wait for completion (2-5 minutes)
   Check status with: What's the status of this investigation?

3. Get RCA
   Show me the root cause analysis

4. Ask follow-ups
   What caused the latency spike?
   Were there any related infrastructure changes?

Parameters

The tool accepts:

Parameter	Type	Default	Description
`prompt`	string	required	Investigation description (min 10 chars)
`project_uuid`	string	optional	Uses default project if not specified

Using Default Project

If you’ve set a default project with neubird_set_default_project, you don’t need to specify project_uuid. Manual investigations will automatically use your default project.

Comparison with Alert-Based Investigations

Feature	Alert-Based	Manual
Trigger	Existing alert	Custom prompt
Use Case	React to alerts	Proactive analysis
Alert ID	Required	Not needed
Prompt	Auto-generated	You provide
Flexibility	Limited to alert data	Any scenario

Understanding the RCA

RCA Structure

Every RCA includes:

1. Incident Summary

What happened
When it happened
Current status

2. Timeline

Chronological event sequence
Key state changes
System behaviors

3. Root Cause

Why it happened
Contributing factors
Technical details

4. Corrective Actions

Immediate fixes (often auto-executed)
Manual actions needed
Ready-to-execute bash scripts

5. Time Savings

Manual investigation time estimate
Actual NeuBird time
Time saved

Asking Follow-Up Questions

Common Follow-Ups

Understanding the cause:

Why did this happen?
Has this happened before?
What changed recently?

Investigation details:

What data sources were checked?
Show me the logs that were analyzed
What queries were run?

Prevention:

How can we prevent this?
What monitoring should we add?
What tests would catch this?

Similar incidents:

Have we seen similar issues?
Is this related to recent deployments?
Are other services affected?

Follow-Up Workflow

# After getting RCA
Show me the chain of thought for this investigation

Uses neubird_get_chain_of_thought to show reasoning steps.

What data sources were consulted?

Uses neubird_get_investigation_sources.

What queries were executed?

Uses neubird_get_investigation_queries.

What are suggested follow-up questions?

Uses neubird_get_follow_up_suggestions.

Implementing Corrective Actions

One of the most powerful features of NeuBird MCP is that you can ask your AI coding agent to directly implement the recommended fixes. Your agent has full context of both your codebase and NeuBird’s analysis.

The Power of MCP Integration

Because NeuBird runs as an MCP server, your coding agent (Claude Code, Claude Desktop, Cursor, etc.) can:

Get the RCA with detailed corrective actions
Understand your codebase through file access
Implement fixes directly in your code
Execute commands through its shell access
Verify the fix by checking logs and metrics

This creates a seamless incident response workflow where investigation and remediation happen in the same conversation.

Workflow: From Investigation to Implementation

Step 1: Get the RCA

Show me the RCA for this investigation

Step 2: Ask your agent to implement the fix

Please implement the corrective actions from this RCA.
Start with the highest priority items.

Step 3: Your agent will:

Read the relevant code files
Make the necessary changes
Run tests to verify
Execute deployment commands (if appropriate)
Confirm the changes

Example: Implementing a Database Index

Investigation finds: Missing index causing slow queries

RCA provides:

CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);

You ask your agent:

Please add this database index. Use a migration file
following our project's migration pattern.

Your agent will:

Check existing migration files to understand the pattern
Create a new migration file (e.g., 20250124_add_orders_user_id_index.sql)
Add the CREATE INDEX statement with proper syntax
Add a corresponding DROP INDEX for rollback
Update the migration tracking
Suggest running the migration command

Example: Fixing a Configuration Issue

Investigation finds: Connection pool too small for traffic

RCA provides:

kubectl set env deployment/api-service MAX_CONNECTIONS=200

You ask your agent:

Please update the MAX_CONNECTIONS configuration to 200.
Update both the deployment manifest and our docs.

Your agent will:

Find the deployment YAML file
Update the environment variable
Update documentation to reflect the change
Suggest applying with kubectl apply -f deployment.yaml
Offer to add monitoring for connection pool usage

Example: Implementing a Code Fix

Investigation finds: Memory leak from unclosed database connections

RCA explains: Connections not being released in error paths

You ask your agent:

Please fix the connection leak in the payment service.
Make sure all code paths properly close connections.

Your agent will:

Read the payment service code
Identify all database connection usage
Add proper try/finally blocks or context managers
Ensure connections are closed in error paths
Add tests to verify the fix
Run existing tests to ensure no regressions

Preventive Measures

After fixing the immediate issue, implement preventive measures:

Please implement the preventive measures from the RCA:
1. Add monitoring for connection pool usage
2. Add alerts for connection pool >80% full
3. Update our runbook with this scenario

Best Practices for Agent-Driven Remediation

✅ Do:

Review the RCA thoroughly before asking for implementation
Be specific about your project’s patterns and conventions
Ask the agent to run tests after making changes
Request documentation updates alongside code changes
Have the agent explain what it’s doing
Use staging environments for testing changes

❌ Don’t:

Blindly apply changes to production without review
Skip testing steps
Implement changes you don’t understand
Ignore the agent’s warnings or questions
Rush through validation steps

Manual Review Required

Some actions require human judgment and shouldn’t be fully automated:

Deployment decisions:

Rolling back to previous versions
Scaling production resources
Database schema changes
Infrastructure modifications

For these, ask your agent to:

Please prepare the rollback command but don't execute it yet.
Show me what will happen and wait for my approval.

Bash Scripts in RCA

RCAs include ready-to-run commands for immediate fixes:

# Example: Add database index
psql -h prod-db.rds.amazonaws.com -U admin -d production

CREATE INDEX CONCURRENTLY idx_orders_user_id
  ON orders(user_id);

You can execute these directly or ask your agent to integrate them into your workflow:

Please add this as a migration following our standard process.

Validation Checklist

Before executing any fix, verify:

Commands are safe (no destructive operations)
Correct environment (prod vs staging)
Proper permissions and access
Backup/rollback plan in place
Tests pass after changes
Changes follow project conventions
Documentation is updated

Advanced Investigation Techniques

Filtering by Severity

Show me only P1 and P2 uninvestigated alerts

Filtering by Date Range

Show me uninvestigated alerts from January 1-15

Searching by Keyword

Find alerts related to "database timeout"

Uses search_term parameter in neubird_list_sessions.

Compact Mode

For large result sets:

Show me all uninvestigated alerts in compact format

Returns minimal details for faster browsing.

Investigation Status Tracking

Check ongoing investigations:

Show me the status of session abc-123-def-456

Uses neubird_get_investigation_status.

Real-Time Progress Monitoring

NeuBird provides live streaming updates during investigations, giving you visibility into what’s happening at each step.

Understanding Progress Updates

When an investigation is running, neubird_get_investigation_status provides:

Current Progress:

progress_percentage: 0-100 indicating completion based on completed steps
current_step: Human-readable description of what’s happening now
status: Investigation state (in_progress, completed, failed)

Investigation Breakdown:

total_steps: Estimated number of investigation steps
completed_steps: How many steps have finished
unique_sources: All data sources consulted (logs, metrics, alarms)

Step Details: Each step includes:

step_id: Unique identifier for detailed lookups
step_number: Sequential step number
description: What question is being answered
category: Type of investigation step (see below)
status: Step state (in_progress, completed, possible_question)
sources_consulted: Which data sources contributed to this step

Investigation Step Categories

Each step is categorized by its purpose:

Category	Purpose	Example
discovery	Initial fact-finding and data gathering	”What was the system.load.1 metric value?“
analysis	Examining patterns and correlations	”Analyzing correlation between CPU and memory”
diagnosis	Root cause identification	”Identifying why the process failed”
remediation	Solution recommendations	”Suggesting corrective actions”
validation	Verification of findings	”Confirming the root cause hypothesis”

Monitoring Recommendations

Polling frequency:

Check status every 10-15 seconds for updates
More frequent polling won’t speed up investigation
Less frequent means you might miss interim progress

What to watch:

Show me the investigation status

# Look for:
# - progress_percentage: How far along (0-100)
# - current_step: What's happening right now
# - unique_sources: Which data sources have been checked
# - completed_steps / total_steps: Step progress

Data Source Attribution

The unique_sources field shows exactly which monitoring systems contributed to the investigation:

Example sources:

unique_sources: [
  "log_datadog.datadog_logs",
  "monitor_datadog.monitor_events",
  "alarm_aws_poc_sandbox.alarm_history",
  "log_aws_poc_sandbox.log_aws_containerinsights_eks_htm_prod",
  "metric_aws_poc_sandbox.cloudwatch_metrics"
]

Why this matters:

Verify comprehensive coverage - Ensure all relevant systems were checked
Understand thoroughness - More sources = more thorough analysis
Identify gaps - Missing expected sources? May indicate sync issues
Debug investigations - See exactly what data was available

Progress Example

Starting an investigation:

{
  "session_uuid": "abc123...",
  "status": "in_progress",
  "progress_percentage": 0,
  "current_step": "Initializing investigation...",
  "investigation_summary": {
    "total_steps": 5,
    "completed_steps": 0,
    "total_sources_consulted": 0,
    "unique_sources": []
  }
}

Mid-investigation:

{
  "session_uuid": "abc123...",
  "status": "in_progress",
  "progress_percentage": 45,
  "current_step": "🔍 Discovery: Analyzing system.load.1 metrics across production hosts",
  "investigation_summary": {
    "total_steps": 5,
    "completed_steps": 2,
    "total_sources_consulted": 8,
    "unique_sources": [
      "log_datadog.datadog_logs",
      "monitor_datadog.monitor_events",
      "alarm_aws_poc_sandbox.alarm_history"
    ],
    "steps": [
      {
        "step_id": "69247c6b517e7056d602abd1",
        "step_number": 1,
        "description": "What was the system.load.1 metric value and which hosts experienced load average above 2.0?",
        "category": "discovery",
        "status": "completed",
        "sources_consulted": [
          "monitor_datadog.monitor_events",
          "log_datadog.datadog_logs"
        ]
      },
      {
        "step_id": "69247c6b517e7056d602abd2",
        "step_number": 2,
        "description": "Analyzing CPU and memory usage patterns during the incident",
        "category": "analysis",
        "status": "in_progress",
        "sources_consulted": [
          "alarm_aws_poc_sandbox.alarm_history"
        ]
      }
    ]
  }
}

Typical Investigation Flow

Phase 1: Discovery (0-30%)

Gathering initial facts
Identifying affected resources
Collecting metrics and logs
Timeline construction

Phase 2: Analysis (30-60%)

Pattern detection
Correlation analysis
Comparing with baselines
Examining related events

Phase 3: Diagnosis (60-85%)

Root cause identification
Validating hypotheses
Determining contributing factors
Impact assessment

Phase 4: Remediation (85-100%)

Generating corrective actions
Creating preventive measures
Calculating time savings
Finalizing RCA report

Common Progress Patterns

Fast start, then slower:

0% → 40% (first minute) → 60% (next 2 minutes) → 100%

Initial discovery is quick, detailed analysis takes longer.

Steady progress:

0% → 20% → 40% → 60% → 80% → 100% (evenly paced)

Indicates straightforward incident with clear data.

Stuck at percentage:

0% → 45% (stays here for 1-2 minutes) → 100%

Normal for complex data queries or correlation analysis.

Investigation Quality

Quality Scoring

Every investigation gets scored on:

Accuracy (Root cause correctness, impact analysis)
Completeness (Data coverage, remediation steps)

Check quality:

Show me the quality score for this investigation

Uses neubird_get_rca_score.

Good scores: 85-100 (Excellent) Acceptable: 70-84 (Good) Needs improvement: <70 (Add more instructions)

Improving Quality

Add context instructions:

Create a SYSTEM instruction about our architecture

Add investigation steps:

Create an RCA instruction for database issues

Provide feedback: If an RCA is incorrect, add an instruction to guide future investigations.

Common Investigation Patterns

Pattern 1: API Latency

Symptoms: Slow API responses, timeouts

Investigation approach:

Check APM traces (Datadog, New Relic)
Review database query performance
Check external API dependencies
Examine cache hit rates
Review resource utilization

Common causes:

Missing database indexes
Inefficient queries
External API slowdowns
Cache misses
Insufficient resources

Pattern 2: Memory Leaks

Symptoms: Increasing memory, eventual OOM crash

Investigation approach:

Review memory growth timeline
Check for unclosed connections
Look for accumulating caches
Examine heap dumps
Review recent code changes

Common causes:

Unclosed database connections
Event listener leaks
Growing in-memory caches
Circular references

Pattern 3: Database Issues

Symptoms: Slow queries, connection timeouts

Investigation approach:

Check slow query logs
Review connection pool metrics
Examine table locks
Check index usage
Review database resource metrics

Common causes:

Missing indexes
Lock contention
Connection pool exhaustion
Long-running transactions
Inefficient queries

Pattern 4: Deployment Failures

Symptoms: Service crashes after deployment

Investigation approach:

Compare deployment manifests
Review container startup logs
Check environment variables
Examine health check failures
Review resource limits

Common causes:

Configuration errors
Missing environment variables
Insufficient resources
Database migration failures
Incompatible dependencies

Workflow Tips

Morning Review

# Check overnight incidents
Show me uninvestigated alerts from the last 12 hours

# Investigate critical ones
Investigate all P1 alerts

Incident Response

# During active incident
Show me alerts from the last 30 minutes

# Quick investigation
Investigate this alert and wait for results

# Get actionable fixes
Show me the corrective actions

Post-Mortem

# Review incident
Show me the complete RCA for session abc-123

# Understand what happened
Show me the timeline and chain of thought

# Prevent recurrence
What preventive measures are recommended?

Troubleshooting Investigations

Investigation Taking Too Long

Normal duration: 30-90 seconds

If longer:

First investigation may take 5-10 minutes (syncing connections)
Complex incidents with many data sources take longer
Check connection sync status

Incomplete RCA

If RCA lacks details:

Add more SYSTEM instructions with context
Add RCA instructions with investigation steps
Ensure connections are properly synced
Check data source availability

Incorrect Root Cause

If RCA misidentifies the cause:

Use follow-up questions to guide investigation
Add instruction with correct approach
Test instruction on past session
Add to project if improved

Next Steps

Use Instructions

Learn to create and test investigation instructions

Using Instructions

Manage Connections

Add and configure data sources

Managing Connections

Advanced Workflows

Power user techniques

Advanced Workflows