Skip to content

Using Instructions

Learn to create, test, and manage investigation instructions to guide NeuBird’s behavior.

Instructions are guidelines you provide to NeuBird that customize how it investigates incidents. They help NeuBird understand your infrastructure, follow your investigation patterns, filter out noise, and focus on what matters for your organization.

Think of instructions as teaching NeuBird about your specific environment and how you want it to work.

NeuBird supports four types of instructions, each serving a specific purpose:

Purpose: Reduce noise by filtering out low-value alerts

When to use:

  • You’re getting too many minor alerts
  • Certain alert sources are unreliable
  • You want to focus on critical incidents only

Example:

Only investigate incidents with:
- Severity P1 or P2
- Affecting production environment
- Not from load testing systems
- Not during scheduled maintenance windows

Impact: Alerts matching these criteria won’t be investigated, reducing noise and focusing attention on important issues.

Purpose: Provide architectural context and infrastructure details

When to use:

  • NeuBird needs to understand your architecture
  • Investigations lack context about your systems
  • You want better root cause analysis
  • You need service-specific guidance

Example:

Our infrastructure:
- Microservices architecture on AWS EKS
- api-service: Handles REST API requests
- payment-service: Processes payments via Stripe
- user-service: Authentication and user management
- notification-service: Email and SMS notifications
- worker-service: Background job processing
- Databases:
- PostgreSQL RDS (primary + 2 read replicas)
- Redis ElastiCache for session storage and caching
- Monitoring:
- Datadog APM for distributed tracing
- CloudWatch for AWS metrics
- PagerDuty for alerting
- Traffic patterns:
- Peak: 1000 req/sec between 2-6pm EST
- Normal: 200 req/sec outside peak hours
- Scheduled deployments: Tuesdays 10am EST

Impact: NeuBird uses this context to make better inferences about root causes and dependencies.

Purpose: Group related alerts together to avoid duplicate investigations

When to use:

  • Multiple alerts fire for the same underlying issue
  • Cascading failures create alert storms
  • You want to investigate grouped incidents once

Example:

Group incidents when:
- Same service and error type within 15 minutes
- Same root cause indicator (deployment, database outage)
- Cascading failures from single upstream service
- Auto-scaling events triggering multiple alerts

Impact: Related alerts are grouped into a single investigation, reducing duplicate work.

Purpose: RCA instructions define what sections to include in RCA reports, language preferences, and how findings should be presented. They standardize report structure across all investigations.

When to use:

  • You need standardized RCA report formatting for your organization
  • Reports must be readable for both technical and non-technical stakeholders
  • You want consistent documentation structure across all incidents
  • Compliance or organizational standards require specific report sections
  • You need RCAs formatted for specific audiences (leadership, regulators, etc.)

Example - Standardized RCA Report Format:

Prompt:

Create an RCA instruction for my Test Production project:
"All RCA reports must include these sections:
1. Executive Summary - A 2-3 sentence overview suitable for leadership
2. Impact Assessment - Include affected services, user impact percentage, and estimated revenue impact if applicable
3. Timeline - Use UTC timestamps and include detection time, first response, and resolution time
4. Root Cause - Explain in both technical and non-technical terms
5. Corrective Actions - Separate into 'Immediate' (within 24h) and 'Long-term' (within 30 days)
6. Prevention Measures - Include specific monitoring thresholds to add
Format the report in markdown with clear headers. Keep technical jargon to a minimum in the Executive Summary section."

What this instruction does:

  • Ensures consistent report structure across all investigations
  • Makes RCAs readable for both technical and non-technical stakeholders
  • Provides clear action items with timelines
  • Includes forward-looking prevention measures

Impact: All RCA reports follow your organization’s documentation standards and include the sections required by your stakeholders.

Problem: Bad instructions affect ALL future investigations.

Solution: Test instructions on past investigations before adding to your project.

Testing ensures your instruction:

  • Actually improves the investigation quality
  • Doesn’t introduce false positives or noise
  • Works with your actual data and alert patterns
  • Produces actionable recommendations
graph LR
    A[Write Instruction] --> B[Validate]
    B --> C[Apply to Test Session]
    C --> D[Rerun Investigation]
    D --> E[Compare RCAs]
    E -->|Better| F[Add to Project]
    E -->|Worse| G[Refine & Retry]
1. Pick a past investigation as test case
2. Validate your instruction
3. Apply instruction to that session
4. Rerun the investigation
5. Compare new RCA with original

Only add to project if the new RCA is better!

Find a past investigation to test against:

Show me investigations from the last 7 days

Pick one that represents the type of incident your instruction targets.

Example:

I want to test an instruction for database issues.
Show me database-related investigations.

Save the session_uuid from the results.

Draft the instruction content:

Example - API Latency Investigation:

For API latency or timeout incidents:
1. Check database query performance in slow query logs
2. Review connection pool metrics (active, idle, waiting)
3. Examine API endpoint traces in Datadog APM
4. Check for downstream service latency
5. Verify cache hit rates in Redis
6. Look for recent deployments or configuration changes
7. Analyze request rate patterns and traffic spikes
8. Provide specific optimization recommendations with commands
Validate this RCA instruction:
"For API latency or timeout incidents:
1. Check database query performance in slow query logs
2. Review connection pool metrics (active, idle, waiting)
3. Examine API endpoint traces in Datadog APM
4. Check for downstream service latency
5. Verify cache hit rates in Redis
6. Look for recent deployments or configuration changes
7. Analyze request rate patterns and traffic spikes
8. Provide specific optimization recommendations with commands"

Uses neubird_validate_instruction:

✓ Instruction validated successfully
Generated name: "API Latency Investigation Methodology"
Type: RCA
Refined content: [AI-improved version]
The instruction is ready to test.
Apply this instruction to session abc-123-def-456

Uses neubird_apply_session_instruction:

✓ Instruction applied to test session
The instruction has been added as a session-specific
override. It will ONLY affect this session when rerun.
Next step: Rerun the investigation to see the impact.
Rerun session abc-123-def-456 with the new instruction

Uses neubird_rerun_session:

🔍 Rerunning investigation...
⏳ Applying new instruction... (5s)
⏳ Re-analyzing data... (30s)
⏳ Generating new RCA... (15s)
✓ Investigation complete! (50s total)
Show me the RCA for session abc-123-def-456

Uses neubird_get_rca:

Root Cause Analysis (UPDATED)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[NEW] Database Query Analysis:
Identified slow query in checkout endpoint:
Query: SELECT * FROM orders WHERE user_id = ?
Execution time: 850ms (vs 5ms baseline)
[NEW] Missing Index Identified:
CREATE INDEX CONCURRENTLY idx_orders_user_id
ON orders(user_id);
Estimated improvement: 850ms → 5ms (99.4% faster)
[NEW] Connection Pool Analysis:
- Current utilization: 95/100 connections
- Peak during incident: 100/100 (exhausted)
- Recommendation: Increase pool size to 150
[NEW] Cache Performance:
- Cache hit rate: 45% (baseline: 85%)
- Cache invalidation spike detected at incident start
- Related to recent deployment at 2:15pm
[NEW] Deployment Correlation:
Deployment of checkout-service v2.3.1 at 2:15pm
introduced N+1 query pattern in order lookup.
The new RCA includes specific actionable insights
that were missing in the original investigation!

Compare with original RCA:

AspectOriginal RCANew RCA
Root cause identified✓ Generic✓ Specific
Specific queries shown✗ No✓ Yes
Index suggestion✗ Generic✓ Specific SQL
Performance estimate✗ No✓ Yes
Connection pool analysis✗ No✓ Yes
Cache analysis✗ No✓ Yes
Deployment correlation✗ No✓ Yes

Verdict: New RCA is significantly better!

This instruction improved the RCA. Add it to my Production project.

Uses neubird_create_project_instruction:

✓ Created RCA instruction for Production project
Name: "API Latency Investigation Methodology"
Status: Active
All future API latency investigations will use this
instruction to provide detailed performance analysis.

✅ Do:

  • Be specific and actionable
  • Reference actual tools and systems you use
  • Include commands or queries when relevant
  • Focus on outcomes (what to find, not how)
  • Use clear, numbered steps for RCA instructions
  • Include context about your environment
  • Test before deploying

❌ Don’t:

  • Write vague instructions (“check performance”)
  • Reference tools you don’t have (“check EXPLAIN output” if not logged)
  • Create instructions for edge cases
  • Make instructions too long (keep under 300 words)
  • Add instructions without testing
  • Duplicate information across instructions

Test multiple scenarios:

1. Test on the incident type you're targeting
2. Test on a related but different incident type
3. Test on an unrelated incident type

This ensures your instruction helps the right cases and doesn’t hurt others.

Iterate based on results:

  • First version too broad? Make it more specific
  • Missing important checks? Add more steps
  • Too prescriptive? Make it more flexible
  • Not improving results? Refine or discard

Review regularly:

Show me all active instructions for Production project

Disable underperforming instructions:

Disable the instruction "Database Performance Checks"

Update instructions as your system evolves:

  • New services added? Update SYSTEM instructions
  • New monitoring tools? Update RCA instructions
  • Alert patterns changed? Update FILTER instructions
For incidents affecting the payment-service:
1. Check Stripe API response times and error rates
2. Review payment transaction logs for failures
3. Examine database connection pool for payment DB
4. Check for PCI compliance logging issues
5. Verify webhook delivery status
6. Look for rate limiting from Stripe
SYSTEM instruction:
Scheduled maintenance:
- Database backups: Daily 2-3am EST
- Deployment windows: Tuesday/Thursday 10am EST
- Cache warmup: After each deployment (5-10 min)
- Traffic patterns: Peak 2-6pm EST weekdays
Consider timing when analyzing incident causes.
GROUPING instruction:
Group incidents when they occur within 15 minutes and:
- Multiple services report connection timeouts
- Database or Redis metrics show anomalies
- Gateway/load balancer shows elevated errors
- Upstream service degradation detected
Likely cascading failure from shared dependency.
FILTER instruction:
Do NOT investigate:
- Alerts from dev/staging environments
- Load test results from automated testing
- Synthetic monitoring checks (exclude alert tag: synthetic)
- Auto-scaling events (normal operational behavior)
- Alerts during maintenance windows (2-3am EST)

Possible causes:

  1. Instruction too vague
  2. Data sources don’t support the checks
  3. Instruction targets wrong incident type
  4. System context missing

Solution: Refine and test again with more specific guidance.

Possible causes:

  1. Too prescriptive (limiting NeuBird’s analysis)
  2. Incorrect assumptions about infrastructure
  3. Conflicts with other instructions

Solution: Disable, refine, or remove the instruction.

Check:

  1. Instruction is active (not disabled)
  2. Instruction type matches use case
  3. No conflicting instructions
  4. Project has the instruction
Show me instruction details for [instruction-uuid]
  • Run Investigations

Apply your instructions to real incidents

Running Investigations

  • Manage Connections

Connect data sources for investigations

Managing Connections

  • Advanced Workflows

Power user techniques

Advanced Workflows