Prompt Engineering

Prompt Engineering for Investigations

Learn how to write effective prompts for manual investigations that produce high-quality root cause analyses.

Why Prompt Quality Matters

The quality of your investigation prompt directly affects:

Investigation accuracy - Better prompts lead to more precise root cause identification
Time to results - Specific prompts help NeuBird focus on relevant data sources
Actionability - Clear context produces more targeted corrective actions
RCA completeness - Detailed prompts result in thorough analysis

The Four Essential Elements

Every great investigation prompt includes these four elements:

1. Service/Resource Names

Why it matters: Tells NeuBird exactly which components to investigate.

✅ Good:

Investigate high latency in payment-service API

❌ Bad:

Investigate the API

Examples:

user-authentication-service
postgres-primary database
production-web-server-01
payment-processor Lambda function
api-gateway in us-east-1

2. Time Frame

Why it matters: Narrows down logs, metrics, and events to the relevant period.

✅ Good:

Between 2pm-3pm EST on January 15, 2025

❌ Bad:

Recently
Earlier today

Examples:

January 15, 2025 at 2:30pm UTC
Between 8am-9am PST yesterday
During the last deployment (345pm EST)
Starting around 10:00am UTC today
Last 30 minutes

Tips:

Always include timezone (UTC, EST, PST)
Use specific dates and times
Provide time ranges when the exact moment is unclear
Reference deployment times or other known events

3. Symptoms

Why it matters: Describes what’s wrong so NeuBird knows what to look for.

✅ Good:

Users reported checkout taking 30+ seconds instead of the usual 2 seconds

❌ Bad:

Something is slow

Examples:

Response times increased from 200ms to 5 seconds
Memory usage climbing from 500MB to 2GB
Connection pool exhausted - max of 100 connections reached
Multiple services returned 503 errors
CPU spiked to 95% on all nodes
Disk I/O wait time exceeded 80%

Tips:

Include specific metrics and thresholds
Compare against normal baselines
Describe user-facing impact when relevant

4. Context

Why it matters: Provides additional clues that help focus the investigation.

✅ Good:

This occurred immediately after deploying version 2.1.5

❌ Bad:

We deployed something

Examples:

Started after database migration to Postgres 14
Coincided with Black Friday traffic spike
Happened during scheduled backup window
No recent deployments or configuration changes
Affects only users in EU region

Complete Prompt Examples

Example 1: Memory Leak

Investigate memory leak in user-api pods in production namespace.
Started around 8am UTC today. Memory usage climbing from 500MB to 2GB over 3 hours.
No recent deployments. Using Kubernetes 1.28.

Why this works:

✅ Service name: user-api pods in production namespace
✅ Time frame: 8am UTC today, over 3 hours
✅ Symptoms: Memory climbing from 500MB to 2GB
✅ Context: No recent deployments, Kubernetes 1.28

Example 2: Database Performance

Analyze database connection pool exhaustion in postgres-primary.
Between 1pm-2pm EST yesterday. Connection count hit max of 100, causing timeouts.
Started during afternoon traffic peak. Connection timeout errors in application logs.

Why this works:

✅ Service name: postgres-primary
✅ Time frame: 1pm-2pm EST yesterday
✅ Symptoms: Connection pool maxed at 100, timeouts
✅ Context: Afternoon traffic peak, timeout errors in logs

Example 3: Deployment Issue

Check for cascading failures in microservices during deployment.
January 15, 2025 at 3:15pm UTC. Multiple services returned 503 errors.
Deployment of api-gateway v2.3.0 triggered the issue. Auth and payment services affected.

Why this works:

✅ Service name: microservices (specifically auth, payment, api-gateway)
✅ Time frame: January 15, 2025 at 3:15pm UTC
✅ Symptoms: 503 errors across multiple services
✅ Context: api-gateway v2.3.0 deployment triggered it

Example 4: Latency Spike

Investigate API latency spike in checkout-service.
January 20, 2025 between 2pm-3pm EST. Response time increased from 200ms to 8 seconds.
Users reported slow checkout during promotion campaign. Database queries appear normal.

Why this works:

✅ Service name: checkout-service
✅ Time frame: January 20, 2025, 2pm-3pm EST
✅ Symptoms: 200ms to 8 seconds response time
✅ Context: Promotion campaign, DB queries normal

Common Mistakes and How to Fix Them

Mistake 1: Too Vague

❌ Bad:

Something is wrong with the payment service

✅ Good:

Investigate payment-api service returning 500 errors.
Started at 2:30pm UTC today. Error rate jumped from 0.1% to 15%.
Coincided with deployment of v3.2.1.

Mistake 2: Missing Time Frame

❌ Bad:

Check why the database is slow

✅ Good:

Investigate slow database queries in postgres-primary.
Between 9am-10am PST on January 18, 2025.
Query response time increased from 50ms to 3 seconds.

Mistake 3: No Specific Metrics

❌ Bad:

The API is having problems

✅ Good:

Investigate API errors in user-service.
January 16, 2025 at 4:45pm EST.
Error rate at 23% (up from baseline of 0.5%). Returning 504 Gateway Timeout.

Mistake 4: Unclear Component Names

❌ Bad:

Check the server issue

✅ Good:

Investigate high CPU usage on prod-web-server-03.
Started at 11:30am UTC today. CPU usage at 92% (normal is 20-30%).
Running nginx 1.24 and serving API traffic.

Advanced Prompt Techniques

For Complex Multi-Service Issues

When investigating issues affecting multiple services:

Investigate cascading failures across microservices architecture.
January 22, 2025 between 5pm-6pm UTC.

Symptoms:
- api-gateway: 503 errors (50% of requests)
- auth-service: increased latency (500ms to 3s)
- user-db: connection pool at 95% capacity
- payment-service: timeouts communicating with Stripe API

Context: Started during evening traffic peak. No deployments in last 24 hours.
Auth-service appears to be the bottleneck.

For Historical Analysis

When investigating past incidents:

Analyze the incident from last week's outage for post-mortem.
January 15, 2025 from 245pm EST.

What happened: Complete service outage affecting all customers.
Services impacted: All production services in us-east-1.
Resolution: Service restored after reverting deployment.

Please provide detailed timeline, root cause, and prevention recommendations.

For Proactive Investigations

When investigating potential issues before they become incidents:

Investigate gradual memory increase trend in worker-service pods.
Over the last 7 days. Memory usage growing from 1GB to 2.5GB steadily.
Not yet causing issues but approaching pod memory limit of 3GB.
Want to understand if this is a leak or expected growth from increased load.

Agent Behavior with Poor Prompts

If your prompt is too vague, Claude will ask clarifying questions:

Your vague prompt:

Investigate the payment issue

Claude’s response:

I need more details to create an effective investigation:

1. Which payment service? (payment-api, payment-processor, payment-gateway, etc.)
2. When did this occur? (specific date/time or time range)
3. What symptoms did you observe?
   - Error messages?
   - Slow response times?
   - Failed transactions?
4. Any additional context?
   - Recent deployments?
   - Traffic changes?
   - Related system changes?

Please provide these details for a more accurate investigation.

Prompt Templates

Template: Service Performance Issue

Investigate [performance issue type] in [service-name].
[Specific date and time with timezone].
[Metric] increased/decreased from [baseline] to [current value].
[Any relevant context about deployments, traffic, or changes].

Template: Error Rate Spike

Investigate error spike in [service-name].
[Date and time range with timezone].
Error rate at [X]% (normal baseline is [Y]%).
[Error type/code]. [User impact description].
[Context about what changed or was happening].

Template: Resource Exhaustion

Investigate [resource type] exhaustion in [service/component].
[Date and time with timezone].
[Resource metric] reached [X] (limit is [Y], normal is [Z]).
[Symptoms observed]. [Related services or impacts].
[Context about load, deployments, or configuration].

Testing Your Prompts

Before creating an investigation, mentally check:

Can someone unfamiliar with the issue understand what happened?
Is the time frame specific enough to narrow down logs?
Are service names exact (as they appear in your systems)?
Do metrics include actual numbers, not just “high” or “slow”?
Is there enough context to guide the investigation?

Quality Indicators

Signs of a good prompt:

2-4 sentences with specific details
Includes all four essential elements
Uses exact service names from your infrastructure
Provides quantitative metrics
Gives enough context without being overly long

Signs of a poor prompt:

Single vague sentence
Uses relative terms (“recently”, “the service”, “something”)
No specific times or metrics
Missing context about what’s abnormal

Best Practices Summary

Be specific - Use exact service names, not generic terms
Include timestamps - Always specify timezone
Quantify symptoms - Use actual metrics and thresholds
Provide context - Mention deployments, traffic, or related changes
Keep it focused - One issue per investigation
Use your monitoring terms - Match names from your dashboards and alerts

Next Steps

Ready to start investigating? Check out these guides:

Running Investigations - Complete investigation workflows
Manual Investigations - Deep dive into manual investigations
Using Instructions - Customize investigation behavior

Pro tip: Save your best prompts as templates. When similar issues occur, you can quickly adapt proven prompts for faster, more consistent investigations.