Using Instructions

Learn to create, test, and manage investigation instructions to guide NeuBird’s behavior.

What Are Instructions?

Instructions are guidelines you provide to NeuBird that customize how it investigates incidents. They help NeuBird understand your infrastructure, follow your investigation patterns, filter out noise, and focus on what matters for your organization.

Think of instructions as teaching NeuBird about your specific environment and how you want it to work.

Instruction Types

NeuBird supports four types of instructions, each serving a specific purpose:

FILTER Instructions

Purpose: Reduce noise by filtering out low-value alerts

When to use:

You’re getting too many minor alerts
Certain alert sources are unreliable
You want to focus on critical incidents only

Example:

Only investigate incidents with:
- Severity P1 or P2
- Affecting production environment
- Not from load testing systems
- Not during scheduled maintenance windows

Impact: Alerts matching these criteria won’t be investigated, reducing noise and focusing attention on important issues.

SYSTEM Instructions

Purpose: Provide architectural context and infrastructure details

When to use:

NeuBird needs to understand your architecture
Investigations lack context about your systems
You want better root cause analysis
You need service-specific guidance

Example:

Our infrastructure:
- Microservices architecture on AWS EKS
  - api-service: Handles REST API requests
  - payment-service: Processes payments via Stripe
  - user-service: Authentication and user management
  - notification-service: Email and SMS notifications
  - worker-service: Background job processing

- Databases:
  - PostgreSQL RDS (primary + 2 read replicas)
  - Redis ElastiCache for session storage and caching

- Monitoring:
  - Datadog APM for distributed tracing
  - CloudWatch for AWS metrics
  - PagerDuty for alerting

- Traffic patterns:
  - Peak: 1000 req/sec between 2-6pm EST
  - Normal: 200 req/sec outside peak hours
  - Scheduled deployments: Tuesdays 10am EST

Impact: NeuBird uses this context to make better inferences about root causes and dependencies.

GROUPING Instructions

Purpose: Group related alerts together to avoid duplicate investigations

When to use:

Multiple alerts fire for the same underlying issue
Cascading failures create alert storms
You want to investigate grouped incidents once

Example:

Group incidents when:
- Same service and error type within 15 minutes
- Same root cause indicator (deployment, database outage)
- Cascading failures from single upstream service
- Auto-scaling events triggering multiple alerts

Impact: Related alerts are grouped into a single investigation, reducing duplicate work.

RCA Instructions

Purpose: RCA instructions define what sections to include in RCA reports, language preferences, and how findings should be presented. They standardize report structure across all investigations.

When to use:

You need standardized RCA report formatting for your organization
Reports must be readable for both technical and non-technical stakeholders
You want consistent documentation structure across all incidents
Compliance or organizational standards require specific report sections
You need RCAs formatted for specific audiences (leadership, regulators, etc.)

Example - Standardized RCA Report Format:

Prompt:

Create an RCA instruction for my Test Production project:

"All RCA reports must include these sections:
1. Executive Summary - A 2-3 sentence overview suitable for leadership
2. Impact Assessment - Include affected services, user impact percentage, and estimated revenue impact if applicable
3. Timeline - Use UTC timestamps and include detection time, first response, and resolution time
4. Root Cause - Explain in both technical and non-technical terms
5. Corrective Actions - Separate into 'Immediate' (within 24h) and 'Long-term' (within 30 days)
6. Prevention Measures - Include specific monitoring thresholds to add

Format the report in markdown with clear headers. Keep technical jargon to a minimum in the Executive Summary section."

What this instruction does:

Ensures consistent report structure across all investigations
Makes RCAs readable for both technical and non-technical stakeholders
Provides clear action items with timelines
Includes forward-looking prevention measures

Impact: All RCA reports follow your organization’s documentation standards and include the sections required by your stakeholders.

Why Test Instructions?

Problem: Bad instructions affect ALL future investigations.

Solution: Test instructions on past investigations before adding to your project.

Testing ensures your instruction:

Actually improves the investigation quality
Doesn’t introduce false positives or noise
Works with your actual data and alert patterns
Produces actionable recommendations

graph LR
    A[Write Instruction] --> B[Validate]
    B --> C[Apply to Test Session]
    C --> D[Rerun Investigation]
    D --> E[Compare RCAs]
    E -->|Better| F[Add to Project]
    E -->|Worse| G[Refine & Retry]

Quick Start: Testing Workflow

5-Step Testing Process

1. Pick a past investigation as test case
2. Validate your instruction
3. Apply instruction to that session
4. Rerun the investigation
5. Compare new RCA with original

Only add to project if the new RCA is better!

Step-by-Step Testing Guide

Step 1: Choose a Test Session

Find a past investigation to test against:

Show me investigations from the last 7 days

Pick one that represents the type of incident your instruction targets.

Example:

I want to test an instruction for database issues.
Show me database-related investigations.

Save the session_uuid from the results.

Step 2: Write Your Instruction

Draft the instruction content:

Example - API Latency Investigation:

For API latency or timeout incidents:

1. Check database query performance in slow query logs
2. Review connection pool metrics (active, idle, waiting)
3. Examine API endpoint traces in Datadog APM
4. Check for downstream service latency
5. Verify cache hit rates in Redis
6. Look for recent deployments or configuration changes
7. Analyze request rate patterns and traffic spikes
8. Provide specific optimization recommendations with commands

Step 3: Validate the Instruction

Validate this RCA instruction:

"For API latency or timeout incidents:
1. Check database query performance in slow query logs
2. Review connection pool metrics (active, idle, waiting)
3. Examine API endpoint traces in Datadog APM
4. Check for downstream service latency
5. Verify cache hit rates in Redis
6. Look for recent deployments or configuration changes
7. Analyze request rate patterns and traffic spikes
8. Provide specific optimization recommendations with commands"

Uses neubird_validate_instruction:

Instruction validated successfully

Generated name: "API Latency Investigation Methodology"
Type: RCA
Refined content: [AI-improved version]

The instruction is ready to test.

Step 4: Apply to Test Session

Apply this instruction to session abc-123-def-456

Uses neubird_apply_session_instruction:

Instruction applied to test session

The instruction has been added as a session-specific
override. It will ONLY affect this session when rerun.

Next step: Rerun the investigation to see the impact.

Step 5: Rerun the Investigation

Rerun session abc-123-def-456 with the new instruction

Uses neubird_rerun_session:

Rerunning investigation...
Applying new instruction... (5s)
Re-analyzing data... (30s)
Generating new RCA... (15s)

Investigation complete! (50s total)

Step 6: Compare Results

Show me the RCA for session abc-123-def-456

Uses neubird_get_rca:

Root Cause Analysis (UPDATED)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

[NEW] Database Query Analysis:
Identified slow query in checkout endpoint:
Query: SELECT * FROM orders WHERE user_id = ?
Execution time: 850ms (vs 5ms baseline)

[NEW] Missing Index Identified:
CREATE INDEX CONCURRENTLY idx_orders_user_id
  ON orders(user_id);

Estimated improvement: 850ms → 5ms (99.4% faster)

[NEW] Connection Pool Analysis:
- Current utilization: 95/100 connections
- Peak during incident: 100/100 (exhausted)
- Recommendation: Increase pool size to 150

[NEW] Cache Performance:
- Cache hit rate: 45% (baseline: 85%)
- Cache invalidation spike detected at incident start
- Related to recent deployment at 2:15pm

[NEW] Deployment Correlation:
Deployment of checkout-service v2.3.1 at 2:15pm
introduced N+1 query pattern in order lookup.

The new RCA includes specific actionable insights
that were missing in the original investigation!

Step 7: Decision Time

Compare with original RCA:

Aspect	Original RCA	New RCA
Root cause identified	Yes (Generic)	Yes (Specific)
Specific queries shown	No	Yes
Index suggestion	Generic	Yes (Specific SQL)
Performance estimate	No	Yes
Connection pool analysis	No	Yes
Cache analysis	No	Yes
Deployment correlation	No	Yes

Verdict: New RCA is significantly better!

Step 8: Add to Project

This instruction improved the RCA. Add it to my Production project.

Uses neubird_create_project_instruction:

Created RCA instruction for Production project
Name: "API Latency Investigation Methodology"
Status: Active

All future API latency investigations will use this
instruction to provide detailed performance analysis.

Best Practices

Writing Effective Instructions

Do:

Be specific and actionable
Reference actual tools and systems you use
Include commands or queries when relevant
Focus on outcomes (what to find, not how)
Use clear, numbered steps for RCA instructions
Include context about your environment
Test before deploying

Don’t:

Write vague instructions (“check performance”)
Reference tools you don’t have (“check EXPLAIN output” if not logged)
Create instructions for edge cases
Make instructions too long (keep under 300 words)
Add instructions without testing
Duplicate information across instructions

Testing Strategies

Test multiple scenarios:

1. Test on the incident type you're targeting
2. Test on a related but different incident type
3. Test on an unrelated incident type

This ensures your instruction helps the right cases and doesn’t hurt others.

Iterate based on results:

First version too broad? Make it more specific
Missing important checks? Add more steps
Too prescriptive? Make it more flexible
Not improving results? Refine or discard

Managing Instructions

Review regularly:

Show me all active instructions for Production project

Disable underperforming instructions:

Disable the instruction "Database Performance Checks"

Update instructions as your system evolves:

New services added? Update SYSTEM instructions
New monitoring tools? Update RCA instructions
Alert patterns changed? Update FILTER instructions

Common Instruction Patterns

Pattern 1: Service-Specific Investigation

For incidents affecting the payment-service:

1. Check Stripe API response times and error rates
2. Review payment transaction logs for failures
3. Examine database connection pool for payment DB
4. Check for PCI compliance logging issues
5. Verify webhook delivery status
6. Look for rate limiting from Stripe

Pattern 2: Time-Based Context

SYSTEM instruction:

Scheduled maintenance:
- Database backups: Daily 2-3am EST
- Deployment windows: Tuesday/Thursday 10am EST
- Cache warmup: After each deployment (5-10 min)
- Traffic patterns: Peak 2-6pm EST weekdays

Consider timing when analyzing incident causes.

Pattern 3: Cascading Failure Detection

GROUPING instruction:

Group incidents when they occur within 15 minutes and:
- Multiple services report connection timeouts
- Database or Redis metrics show anomalies
- Gateway/load balancer shows elevated errors
- Upstream service degradation detected

Likely cascading failure from shared dependency.

Pattern 4: Noise Reduction

FILTER instruction:

Do NOT investigate:
- Alerts from dev/staging environments
- Load test results from automated testing
- Synthetic monitoring checks (exclude alert tag: synthetic)
- Auto-scaling events (normal operational behavior)
- Alerts during maintenance windows (2-3am EST)

Troubleshooting Instructions

Instruction Not Improving RCA

Possible causes:

Instruction too vague
Data sources don’t support the checks
Instruction targets wrong incident type
System context missing

Solution: Refine and test again with more specific guidance.

Instruction Causing Worse Results

Possible causes:

Too prescriptive (limiting NeuBird’s analysis)
Incorrect assumptions about infrastructure
Conflicts with other instructions

Solution: Disable, refine, or remove the instruction.

Instruction Not Being Applied

Check:

Instruction is active (not disabled)
Instruction type matches use case
No conflicting instructions
Project has the instruction

Show me instruction details for [instruction-uuid]

Next Steps

Run Investigations

Apply your instructions to real incidents

Running Investigations

Manage Connections

Connect data sources for investigations

Managing Connections

Advanced Workflows

Power user techniques

Advanced Workflows