Skip to content

Prompt Engineering

Learn how to write effective prompts for manual investigations that produce high-quality root cause analyses.

The quality of your investigation prompt directly affects:

  • Investigation accuracy - Better prompts lead to more precise root cause identification
  • Time to results - Specific prompts help NeuBird focus on relevant data sources
  • Actionability - Clear context produces more targeted corrective actions
  • RCA completeness - Detailed prompts result in thorough analysis

Every great investigation prompt includes these four elements:

Why it matters: Tells NeuBird exactly which components to investigate.

✅ Good:

Investigate high latency in payment-service API

❌ Bad:

Investigate the API

Examples:

  • user-authentication-service
  • postgres-primary database
  • production-web-server-01
  • payment-processor Lambda function
  • api-gateway in us-east-1

Why it matters: Narrows down logs, metrics, and events to the relevant period.

✅ Good:

Between 2pm-3pm EST on January 15, 2025

❌ Bad:

Recently
Earlier today

Examples:

  • January 15, 2025 at 2:30pm UTC
  • Between 8am-9am PST yesterday
  • During the last deployment (345pm EST)
  • Starting around 10:00am UTC today
  • Last 30 minutes

Tips:

  • Always include timezone (UTC, EST, PST)
  • Use specific dates and times
  • Provide time ranges when the exact moment is unclear
  • Reference deployment times or other known events

Why it matters: Describes what’s wrong so NeuBird knows what to look for.

✅ Good:

Users reported checkout taking 30+ seconds instead of the usual 2 seconds

❌ Bad:

Something is slow

Examples:

  • Response times increased from 200ms to 5 seconds
  • Memory usage climbing from 500MB to 2GB
  • Connection pool exhausted - max of 100 connections reached
  • Multiple services returned 503 errors
  • CPU spiked to 95% on all nodes
  • Disk I/O wait time exceeded 80%

Tips:

  • Include specific metrics and thresholds
  • Compare against normal baselines
  • Describe user-facing impact when relevant

Why it matters: Provides additional clues that help focus the investigation.

✅ Good:

This occurred immediately after deploying version 2.1.5

❌ Bad:

We deployed something

Examples:

  • Started after database migration to Postgres 14
  • Coincided with Black Friday traffic spike
  • Happened during scheduled backup window
  • No recent deployments or configuration changes
  • Affects only users in EU region
Investigate memory leak in user-api pods in production namespace.
Started around 8am UTC today. Memory usage climbing from 500MB to 2GB over 3 hours.
No recent deployments. Using Kubernetes 1.28.

Why this works:

  • ✅ Service name: user-api pods in production namespace
  • ✅ Time frame: 8am UTC today, over 3 hours
  • ✅ Symptoms: Memory climbing from 500MB to 2GB
  • ✅ Context: No recent deployments, Kubernetes 1.28
Analyze database connection pool exhaustion in postgres-primary.
Between 1pm-2pm EST yesterday. Connection count hit max of 100, causing timeouts.
Started during afternoon traffic peak. Connection timeout errors in application logs.

Why this works:

  • ✅ Service name: postgres-primary
  • ✅ Time frame: 1pm-2pm EST yesterday
  • ✅ Symptoms: Connection pool maxed at 100, timeouts
  • ✅ Context: Afternoon traffic peak, timeout errors in logs
Check for cascading failures in microservices during deployment.
January 15, 2025 at 3:15pm UTC. Multiple services returned 503 errors.
Deployment of api-gateway v2.3.0 triggered the issue. Auth and payment services affected.

Why this works:

  • ✅ Service name: microservices (specifically auth, payment, api-gateway)
  • ✅ Time frame: January 15, 2025 at 3:15pm UTC
  • ✅ Symptoms: 503 errors across multiple services
  • ✅ Context: api-gateway v2.3.0 deployment triggered it
Investigate API latency spike in checkout-service.
January 20, 2025 between 2pm-3pm EST. Response time increased from 200ms to 8 seconds.
Users reported slow checkout during promotion campaign. Database queries appear normal.

Why this works:

  • ✅ Service name: checkout-service
  • ✅ Time frame: January 20, 2025, 2pm-3pm EST
  • ✅ Symptoms: 200ms to 8 seconds response time
  • ✅ Context: Promotion campaign, DB queries normal

❌ Bad:

Something is wrong with the payment service

✅ Good:

Investigate payment-api service returning 500 errors.
Started at 2:30pm UTC today. Error rate jumped from 0.1% to 15%.
Coincided with deployment of v3.2.1.

❌ Bad:

Check why the database is slow

✅ Good:

Investigate slow database queries in postgres-primary.
Between 9am-10am PST on January 18, 2025.
Query response time increased from 50ms to 3 seconds.

❌ Bad:

The API is having problems

✅ Good:

Investigate API errors in user-service.
January 16, 2025 at 4:45pm EST.
Error rate at 23% (up from baseline of 0.5%). Returning 504 Gateway Timeout.

❌ Bad:

Check the server issue

✅ Good:

Investigate high CPU usage on prod-web-server-03.
Started at 11:30am UTC today. CPU usage at 92% (normal is 20-30%).
Running nginx 1.24 and serving API traffic.

When investigating issues affecting multiple services:

Investigate cascading failures across microservices architecture.
January 22, 2025 between 5pm-6pm UTC.
Symptoms:
- api-gateway: 503 errors (50% of requests)
- auth-service: increased latency (500ms to 3s)
- user-db: connection pool at 95% capacity
- payment-service: timeouts communicating with Stripe API
Context: Started during evening traffic peak. No deployments in last 24 hours.
Auth-service appears to be the bottleneck.

When investigating past incidents:

Analyze the incident from last week's outage for post-mortem.
January 15, 2025 from 245pm EST.
What happened: Complete service outage affecting all customers.
Services impacted: All production services in us-east-1.
Resolution: Service restored after reverting deployment.
Please provide detailed timeline, root cause, and prevention recommendations.

When investigating potential issues before they become incidents:

Investigate gradual memory increase trend in worker-service pods.
Over the last 7 days. Memory usage growing from 1GB to 2.5GB steadily.
Not yet causing issues but approaching pod memory limit of 3GB.
Want to understand if this is a leak or expected growth from increased load.

If your prompt is too vague, Claude will ask clarifying questions:

Your vague prompt:

Investigate the payment issue

Claude’s response:

I need more details to create an effective investigation:
1. Which payment service? (payment-api, payment-processor, payment-gateway, etc.)
2. When did this occur? (specific date/time or time range)
3. What symptoms did you observe?
- Error messages?
- Slow response times?
- Failed transactions?
4. Any additional context?
- Recent deployments?
- Traffic changes?
- Related system changes?
Please provide these details for a more accurate investigation.
Investigate [performance issue type] in [service-name].
[Specific date and time with timezone].
[Metric] increased/decreased from [baseline] to [current value].
[Any relevant context about deployments, traffic, or changes].
Investigate error spike in [service-name].
[Date and time range with timezone].
Error rate at [X]% (normal baseline is [Y]%).
[Error type/code]. [User impact description].
[Context about what changed or was happening].
Investigate [resource type] exhaustion in [service/component].
[Date and time with timezone].
[Resource metric] reached [X] (limit is [Y], normal is [Z]).
[Symptoms observed]. [Related services or impacts].
[Context about load, deployments, or configuration].

Before creating an investigation, mentally check:

  • Can someone unfamiliar with the issue understand what happened?
  • Is the time frame specific enough to narrow down logs?
  • Are service names exact (as they appear in your systems)?
  • Do metrics include actual numbers, not just “high” or “slow”?
  • Is there enough context to guide the investigation?

Signs of a good prompt:

  • 2-4 sentences with specific details
  • Includes all four essential elements
  • Uses exact service names from your infrastructure
  • Provides quantitative metrics
  • Gives enough context without being overly long

Signs of a poor prompt:

  • Single vague sentence
  • Uses relative terms (“recently”, “the service”, “something”)
  • No specific times or metrics
  • Missing context about what’s abnormal
  1. Be specific - Use exact service names, not generic terms
  2. Include timestamps - Always specify timezone
  3. Quantify symptoms - Use actual metrics and thresholds
  4. Provide context - Mention deployments, traffic, or related changes
  5. Keep it focused - One issue per investigation
  6. Use your monitoring terms - Match names from your dashboards and alerts

Ready to start investigating? Check out these guides:


Pro tip: Save your best prompts as templates. When similar issues occur, you can quickly adapt proven prompts for faster, more consistent investigations.