Prompt Engineering
Prompt Engineering for Investigations
Section titled “Prompt Engineering for Investigations”Learn how to write effective prompts for manual investigations that produce high-quality root cause analyses.
Why Prompt Quality Matters
Section titled “Why Prompt Quality Matters”The quality of your investigation prompt directly affects:
- Investigation accuracy - Better prompts lead to more precise root cause identification
- Time to results - Specific prompts help NeuBird focus on relevant data sources
- Actionability - Clear context produces more targeted corrective actions
- RCA completeness - Detailed prompts result in thorough analysis
The Four Essential Elements
Section titled “The Four Essential Elements”Every great investigation prompt includes these four elements:
1. Service/Resource Names
Section titled “1. Service/Resource Names”Why it matters: Tells NeuBird exactly which components to investigate.
✅ Good:
Investigate high latency in payment-service API❌ Bad:
Investigate the APIExamples:
user-authentication-servicepostgres-primary databaseproduction-web-server-01payment-processor Lambda functionapi-gateway in us-east-1
2. Time Frame
Section titled “2. Time Frame”Why it matters: Narrows down logs, metrics, and events to the relevant period.
✅ Good:
Between 2pm-3pm EST on January 15, 2025❌ Bad:
RecentlyEarlier todayExamples:
January 15, 2025 at 2:30pm UTCBetween 8am-9am PST yesterdayDuring the last deployment (345pm EST)Starting around 10:00am UTC todayLast 30 minutes
Tips:
- Always include timezone (UTC, EST, PST)
- Use specific dates and times
- Provide time ranges when the exact moment is unclear
- Reference deployment times or other known events
3. Symptoms
Section titled “3. Symptoms”Why it matters: Describes what’s wrong so NeuBird knows what to look for.
✅ Good:
Users reported checkout taking 30+ seconds instead of the usual 2 seconds❌ Bad:
Something is slowExamples:
Response times increased from 200ms to 5 secondsMemory usage climbing from 500MB to 2GBConnection pool exhausted - max of 100 connections reachedMultiple services returned 503 errorsCPU spiked to 95% on all nodesDisk I/O wait time exceeded 80%
Tips:
- Include specific metrics and thresholds
- Compare against normal baselines
- Describe user-facing impact when relevant
4. Context
Section titled “4. Context”Why it matters: Provides additional clues that help focus the investigation.
✅ Good:
This occurred immediately after deploying version 2.1.5❌ Bad:
We deployed somethingExamples:
Started after database migration to Postgres 14Coincided with Black Friday traffic spikeHappened during scheduled backup windowNo recent deployments or configuration changesAffects only users in EU region
Complete Prompt Examples
Section titled “Complete Prompt Examples”Example 1: Memory Leak
Section titled “Example 1: Memory Leak”Investigate memory leak in user-api pods in production namespace.Started around 8am UTC today. Memory usage climbing from 500MB to 2GB over 3 hours.No recent deployments. Using Kubernetes 1.28.Why this works:
- ✅ Service name:
user-api pods in production namespace - ✅ Time frame:
8am UTC today, over 3 hours - ✅ Symptoms:
Memory climbing from 500MB to 2GB - ✅ Context:
No recent deployments, Kubernetes 1.28
Example 2: Database Performance
Section titled “Example 2: Database Performance”Analyze database connection pool exhaustion in postgres-primary.Between 1pm-2pm EST yesterday. Connection count hit max of 100, causing timeouts.Started during afternoon traffic peak. Connection timeout errors in application logs.Why this works:
- ✅ Service name:
postgres-primary - ✅ Time frame:
1pm-2pm EST yesterday - ✅ Symptoms:
Connection pool maxed at 100, timeouts - ✅ Context:
Afternoon traffic peak, timeout errors in logs
Example 3: Deployment Issue
Section titled “Example 3: Deployment Issue”Check for cascading failures in microservices during deployment.January 15, 2025 at 3:15pm UTC. Multiple services returned 503 errors.Deployment of api-gateway v2.3.0 triggered the issue. Auth and payment services affected.Why this works:
- ✅ Service name:
microservices (specifically auth, payment, api-gateway) - ✅ Time frame:
January 15, 2025 at 3:15pm UTC - ✅ Symptoms:
503 errors across multiple services - ✅ Context:
api-gateway v2.3.0 deployment triggered it
Example 4: Latency Spike
Section titled “Example 4: Latency Spike”Investigate API latency spike in checkout-service.January 20, 2025 between 2pm-3pm EST. Response time increased from 200ms to 8 seconds.Users reported slow checkout during promotion campaign. Database queries appear normal.Why this works:
- ✅ Service name:
checkout-service - ✅ Time frame:
January 20, 2025, 2pm-3pm EST - ✅ Symptoms:
200ms to 8 seconds response time - ✅ Context:
Promotion campaign, DB queries normal
Common Mistakes and How to Fix Them
Section titled “Common Mistakes and How to Fix Them”Mistake 1: Too Vague
Section titled “Mistake 1: Too Vague”❌ Bad:
Something is wrong with the payment service✅ Good:
Investigate payment-api service returning 500 errors.Started at 2:30pm UTC today. Error rate jumped from 0.1% to 15%.Coincided with deployment of v3.2.1.Mistake 2: Missing Time Frame
Section titled “Mistake 2: Missing Time Frame”❌ Bad:
Check why the database is slow✅ Good:
Investigate slow database queries in postgres-primary.Between 9am-10am PST on January 18, 2025.Query response time increased from 50ms to 3 seconds.Mistake 3: No Specific Metrics
Section titled “Mistake 3: No Specific Metrics”❌ Bad:
The API is having problems✅ Good:
Investigate API errors in user-service.January 16, 2025 at 4:45pm EST.Error rate at 23% (up from baseline of 0.5%). Returning 504 Gateway Timeout.Mistake 4: Unclear Component Names
Section titled “Mistake 4: Unclear Component Names”❌ Bad:
Check the server issue✅ Good:
Investigate high CPU usage on prod-web-server-03.Started at 11:30am UTC today. CPU usage at 92% (normal is 20-30%).Running nginx 1.24 and serving API traffic.Advanced Prompt Techniques
Section titled “Advanced Prompt Techniques”For Complex Multi-Service Issues
Section titled “For Complex Multi-Service Issues”When investigating issues affecting multiple services:
Investigate cascading failures across microservices architecture.January 22, 2025 between 5pm-6pm UTC.
Symptoms:- api-gateway: 503 errors (50% of requests)- auth-service: increased latency (500ms to 3s)- user-db: connection pool at 95% capacity- payment-service: timeouts communicating with Stripe API
Context: Started during evening traffic peak. No deployments in last 24 hours.Auth-service appears to be the bottleneck.For Historical Analysis
Section titled “For Historical Analysis”When investigating past incidents:
Analyze the incident from last week's outage for post-mortem.January 15, 2025 from 245pm EST.
What happened: Complete service outage affecting all customers.Services impacted: All production services in us-east-1.Resolution: Service restored after reverting deployment.
Please provide detailed timeline, root cause, and prevention recommendations.For Proactive Investigations
Section titled “For Proactive Investigations”When investigating potential issues before they become incidents:
Investigate gradual memory increase trend in worker-service pods.Over the last 7 days. Memory usage growing from 1GB to 2.5GB steadily.Not yet causing issues but approaching pod memory limit of 3GB.Want to understand if this is a leak or expected growth from increased load.Agent Behavior with Poor Prompts
Section titled “Agent Behavior with Poor Prompts”If your prompt is too vague, Claude will ask clarifying questions:
Your vague prompt:
Investigate the payment issueClaude’s response:
I need more details to create an effective investigation:
1. Which payment service? (payment-api, payment-processor, payment-gateway, etc.)2. When did this occur? (specific date/time or time range)3. What symptoms did you observe? - Error messages? - Slow response times? - Failed transactions?4. Any additional context? - Recent deployments? - Traffic changes? - Related system changes?
Please provide these details for a more accurate investigation.Prompt Templates
Section titled “Prompt Templates”Template: Service Performance Issue
Section titled “Template: Service Performance Issue”Investigate [performance issue type] in [service-name].[Specific date and time with timezone].[Metric] increased/decreased from [baseline] to [current value].[Any relevant context about deployments, traffic, or changes].Template: Error Rate Spike
Section titled “Template: Error Rate Spike”Investigate error spike in [service-name].[Date and time range with timezone].Error rate at [X]% (normal baseline is [Y]%).[Error type/code]. [User impact description].[Context about what changed or was happening].Template: Resource Exhaustion
Section titled “Template: Resource Exhaustion”Investigate [resource type] exhaustion in [service/component].[Date and time with timezone].[Resource metric] reached [X] (limit is [Y], normal is [Z]).[Symptoms observed]. [Related services or impacts].[Context about load, deployments, or configuration].Testing Your Prompts
Section titled “Testing Your Prompts”Before creating an investigation, mentally check:
- Can someone unfamiliar with the issue understand what happened?
- Is the time frame specific enough to narrow down logs?
- Are service names exact (as they appear in your systems)?
- Do metrics include actual numbers, not just “high” or “slow”?
- Is there enough context to guide the investigation?
Quality Indicators
Section titled “Quality Indicators”Signs of a good prompt:
- 2-4 sentences with specific details
- Includes all four essential elements
- Uses exact service names from your infrastructure
- Provides quantitative metrics
- Gives enough context without being overly long
Signs of a poor prompt:
- Single vague sentence
- Uses relative terms (“recently”, “the service”, “something”)
- No specific times or metrics
- Missing context about what’s abnormal
Best Practices Summary
Section titled “Best Practices Summary”- Be specific - Use exact service names, not generic terms
- Include timestamps - Always specify timezone
- Quantify symptoms - Use actual metrics and thresholds
- Provide context - Mention deployments, traffic, or related changes
- Keep it focused - One issue per investigation
- Use your monitoring terms - Match names from your dashboards and alerts
Next Steps
Section titled “Next Steps”Ready to start investigating? Check out these guides:
- Running Investigations - Complete investigation workflows
- Manual Investigations - Deep dive into manual investigations
- Using Instructions - Customize investigation behavior
Pro tip: Save your best prompts as templates. When similar issues occur, you can quickly adapt proven prompts for faster, more consistent investigations.