Behavior-First Debugging: Metrics and Logs Before Code

You've been staring at the stack trace for twenty minutes. The error message is there—clear as day—but it's not telling you why it's happening. You're three levels deep into function calls, your hypothesis is shakier than it was five minutes ago, and the real problem could be anywhere: the code, the infrastructure, the data, or some combination nobody's thought of yet.
Sound familiar?
Here's the thing about debugging: we're trained to think code-first. Something breaks, we look at the code. It's intuitive, it's what we've been taught, and most of the time it feels productive. But here's what actually happens: you spend 80% of your time narrowing the scope and 20% fixing it. The narrowing part? That's where behavior-first debugging wins.
At -iNet, we've learned that the fastest path from "something's broken" to "here's the fix" isn't through the codebase. It's through the observable behavior—the metrics, the logs, the user experience. You gather data first, form a hypothesis, then validate it against the code. It sounds like an extra step. It's actually the opposite.
Let's walk through how to do this with Claude Code, the CLI tool that lets you automate and structure your debugging workflow.
Table of Contents
- The Problem with Code-First Debugging
- Step 1: Establish the Observable Behavior
- Step 2: Build a Timeline from Logs
- Step 3: Correlate Logs with Metrics
- Step 4: Form a Data-Driven Hypothesis
- Step 5: Validate the Hypothesis Against Code
- The Full Workflow in Claude Code
- Why This Matters: Real-World Impact
- Tools and Commands in Claude Code
- Common Pitfalls and How to Avoid Them
- Building Behavior-First Into Your Workflow
- Advanced: Multi-Service Debugging with Behavior-First
- Metrics You Actually Need
- Logs: Structured Is Non-Negotiable
- Advanced Multi-Layer Debugging
- Correlating Across Infrastructure Layers
- Dependency Chain Tracing
- Synthetic Transactions for Edge Cases
- When Behavior-First Hits Its Limit
- Scaling the Approach
- The Human Element
- Conclusion: Data First, Code Second
- The Cultural Shift: Moving Your Team to Behavior-First
- Real-World Metrics That Matter
The Problem with Code-First Debugging
Before we talk about the solution, let's be explicit about why jumping straight to code hurts. When something fails in production, you have multiple possible failure points:
- The service itself might be crashing, timing out, or returning errors
- The dependencies (databases, APIs, caches) might be slow, unreachable, or returning unexpected data
- The data flowing through the system might be malformed, missing, or corrupted
- The infrastructure might be hitting resource limits, network issues, or scaling problems
- The code, finally, might actually have a bug
If you start with code, you're assuming you already know which layer failed. Most of the time, you don't. You're guessing. And guessing wrong costs you time.
Here's what happens instead: you examine a perfectly fine function for forty minutes, build a mental model of its behavior, maybe even spot a theoretical issue that's actually irrelevant. Meanwhile, the real problem was that your database queries started timing out because a table hit 10 million rows and nobody had pagination.
Behavior-first debugging flips the script. You start by answering: What actually happened when the system broke? That answer comes from metrics and logs, not from reading code.
Step 1: Establish the Observable Behavior
The first thing you need is clarity on what "broken" actually means. This seems obvious, but it's where most investigations derail. You need specifics:
- When did it start? What's the exact timestamp?
- Who noticed? Automated monitoring? A specific user? Support tickets?
- What's the symptom? Slow response? Error responses? Partial failures?
- What's the scope? All users? One region? One feature?
This is where your metrics dashboard becomes your north star. If you don't have one, this is the moment to recognize you need one. We're not talking about complex stuff—just basic time-series data:
- Request latency (p50, p95, p99)
- Error rates (5xx, 4xx, specific error codes)
- Throughput (requests per second)
- Resource utilization (CPU, memory, disk, connections)
Let's say you've got a monitoring tool (Prometheus, DataDog, New Relic—doesn't matter). You open the dashboard and you see this pattern:
13:45:00 - Everything normal, ~200ms p95 latency, 0.02% error rate
13:46:00 - Latency spikes to 800ms p95
13:47:00 - Latency hits 2.5 seconds, error rate climbs to 5%
13:48:00 - Partial recovery, latency back to 1.2 seconds
13:52:00 - Full recovery
That timeline is gold. It's not the code at that point—it's a pattern. Something happened between 13:45 and 13:46 that caused a shift in behavior. What happened?
With Claude Code, you can structure this investigation:
#!/bin/bash
# behavior-first-debugging-setup.sh
# Define our incident window
INCIDENT_START="2024-03-16T13:45:00Z"
INCIDENT_END="2024-03-16T13:52:00Z"
P95_BASELINE=200 # milliseconds
P95_SPIKE=800 # milliseconds
ERROR_BASELINE=0.02 # percent
ERROR_SPIKE=5 # percent
# Query your metrics system for the incident window
# (This example uses a generic curl pattern; adapt to your metrics API)
echo "=== BEHAVIORAL SNAPSHOT ==="
echo "Incident window: $INCIDENT_START to $INCIDENT_END"
echo "Latency baseline: ${P95_BASELINE}ms → Spike: ${P95_SPIKE}ms (${MULTIPLIER}x increase)"
echo "Error rate baseline: ${ERROR_BASELINE}% → Spike: ${ERROR_SPIKE}% (${ERROR_MULTIPLIER}x increase)"
echo ""
echo "Pattern: Gradual degradation starting at 13:46, recovery after 13:52"
echo "Hypothesis: Resource constraint or dependency slowdown"You haven't looked at any code yet. You have a timeline, a pattern, and a starting hypothesis. That's the foundation.
Step 2: Build a Timeline from Logs
With your metrics pattern established, you now filter logs to create a detailed timeline. This is crucial—logs are verbose, chaotic, and full of noise. You're not looking for all logs; you're looking for logs that fall within your incident window and correlate with the behavior change.
Here's the key insight: logs should confirm or refute your hypothesis, not generate it.
With Claude Code, you can automate log filtering across multiple sources:
#!/bin/bash
# log-filtering-incident.sh
INCIDENT_START_UNIX=1710599100 # 2024-03-16T13:45:00Z in Unix time
INCIDENT_END_UNIX=1710599520 # 2024-03-16T13:52:00Z in Unix time
# Filter application logs for errors/warnings in the incident window
echo "=== APPLICATION LOGS (Errors & Warnings) ==="
grep -E "ERROR|WARN" /var/log/app.log | \
awk -v start=$INCIDENT_START_UNIX -v end=$INCIDENT_END_UNIX \
'$2 >= start && $2 <= end' | \
head -50
# Filter database logs for slow queries (>1 second) in the incident window
echo ""
echo "=== DATABASE LOGS (Slow Queries >1s) ==="
grep "duration:" /var/log/database.log | \
awk -v start=$INCIDENT_START_UNIX -v end=$INCIDENT_END_UNIX \
'$2 >= start && $2 <= end && $5 > 1000' | \
sort -k5 -rn | \
head -20
# Filter system logs for resource exhaustion signals
echo ""
echo "=== SYSTEM LOGS (Resource Events) ==="
grep -E "OOM|disk full|connection refused|timeout" /var/log/syslog | \
awk -v start=$INCIDENT_START_UNIX -v end=$INCIDENT_END_UNIX \
'$2 >= start && $2 <= end'
# Filter deployment/config logs to see if anything changed
echo ""
echo "=== DEPLOYMENT/CONFIG CHANGES ==="
grep -E "deploy|config|restart|upgrade" /var/log/deployment.log | \
awk -v start=$INCIDENT_START_UNIX -v end=$INCIDENT_END_UNIX \
'$2 >= start && $2 <= end'Notice the structure: we're filtering down, not reading everything. We're looking for signals that correlate with the behavior change. If your metrics showed latency spiking at 13:46, you're looking for logs between 13:45:30 and 13:46:30 that explain why.
A concrete example: what if the logs show this?
13:45:45 [WARN] Database pool at 95% capacity (95/100 connections)
13:46:02 [ERROR] Failed to acquire connection: pool exhausted, timeout after 30s
13:46:03 [ERROR] Failed to acquire connection: pool exhausted, timeout after 30s
13:46:04 [ERROR] Failed to acquire connection: pool exhausted, timeout after 30s
... (repeats every second for 6 minutes)
13:52:15 [INFO] Database connection released, pool at 40% capacity
13:52:16 [INFO] Database pool recovered, normal operations resumed
Boom. That's your answer. It's not in the code; it's in the infrastructure. A query somewhere started holding connections too long, and the pool couldn't handle it. Now you've got a hypothesis grounded in real data: a specific query is not releasing database connections in time.
That's worth infinitely more than "the code must be wrong somewhere."
Step 3: Correlate Logs with Metrics
Now you've got a timeline. The next step is to confirm that your log patterns match your metric patterns. They should. If they don't, your investigation was wrong somewhere.
Here's a validation pattern:
#!/bin/bash
# correlate-logs-metrics.sh
# Extract error count from logs in 1-minute intervals
echo "=== ERROR TIMELINE FROM LOGS ==="
for minute in {13:45..13:52}; do
log_errors=$(grep "ERROR" /var/log/app.log | \
grep "$minute:" | wc -l)
echo "$minute:00 - $log_errors errors"
done
# Compare to metric data (from your dashboard export or API)
echo ""
echo "=== ERROR RATE TIMELINE FROM METRICS ==="
# This would typically come from querying your metrics API
# For demonstration, assume you've exported this data:
cat << 'EOF'
13:45:00 - 0.02% error rate (~1 error per 5000 requests)
13:46:00 - 1.2% error rate (~60 errors per 5000 requests)
13:47:00 - 5.0% error rate (~250 errors per 5000 requests)
13:48:00 - 2.1% error rate (~105 errors per 5000 requests)
13:52:00 - 0.05% error rate (~2 errors per 5000 requests)
EOF
echo ""
echo "=== VALIDATION ==="
echo "Do the error logs increase at the same time as the metrics?"
echo "Do the error logs decrease at the same time as the metrics?"
echo "If YES to both: high confidence in our hypothesis"
echo "If NO to either: need to dig deeper"This step is about building confidence. If your logs tell a different story than your metrics, you've found something important: maybe your logging is missing data, or maybe there's a lag in how metrics are collected. Either way, it's worth investigating.
Step 4: Form a Data-Driven Hypothesis
By now, you should be able to write down a concrete hypothesis about what happened. Not a guess—a hypothesis grounded in metrics and logs:
Hypothesis: Between 13:45 and 13:46, a specific query (SELECT * FROM orders WHERE created_at > NOW() - INTERVAL 1 DAY) started executing on every request instead of being cached. This query is slow (2-3 seconds) because the orders table is unindexed on the created_at column. Each slow query holds a database connection for 3 seconds. The connection pool (100 connections) was exhausted within one minute, causing all subsequent requests to fail to acquire a connection within the timeout window (30 seconds), resulting in cascading timeouts and errors across the system. The issue resolved at 13:52 when an upstream change (likely a deploy rollback or config change) stopped executing the problematic query.
Notice what's in this hypothesis:
- The specific behavior: A query executing on every request
- Why it happened: Something changed between 13:45 and 13:46
- The failure mechanism: Slow query → connection exhaustion → cascading failures
- The recovery: A change at 13:52
And notice what's NOT in this hypothesis:
- "There's a bug in the code" (too vague)
- "The code is slow" (not specific enough)
- Assumptions about which function is broken (we don't know yet)
Step 5: Validate the Hypothesis Against Code
Now—finally—you look at the code. But you're not looking at everything. You're looking for one specific thing: is there a place where a query like SELECT * FROM orders WHERE created_at > NOW() - INTERVAL 1 DAY is executed?
#!/bin/bash
# hypothesis-validation-code-search.sh
echo "=== SEARCHING FOR THE PROBLEMATIC QUERY ==="
grep -r "SELECT.*FROM orders" src/ | grep -i "created_at"
echo ""
echo "=== SEARCHING FOR UNCACHED QUERIES ==="
grep -r "db.query\|Database.execute" src/ | \
grep -v "cache\|memoize\|@Cacheable"
echo ""
echo "=== SEARCHING FOR RECENT CHANGES (Last 24 hours) ==="
git log --since="24 hours ago" --oneline -- src/
echo ""
echo "=== CHECKING FOR DEPLOY/CONFIG CHANGES AT 13:45 ==="
git log --before="2024-03-16T13:47:00Z" --after="2024-03-16T13:44:00Z"You're searching for specific evidence, not just reading code. This is much faster. If you find the query, you've confirmed your hypothesis. If you don't find it, you've learned something important: maybe the problematic behavior is happening at a different layer (middleware, proxy, cache, etc.).
If you do find the query, the next step is lightweight: was it added or modified in the last deployment? What changed?
#!/bin/bash
# show-recent-changes.sh
# Find the file containing the problematic query
QUERY_FILE=$(grep -r "SELECT.*FROM orders" src/ | grep "created_at" | cut -d: -f1 | head -1)
# Show the recent changes to that file
echo "=== RECENT CHANGES TO $QUERY_FILE ==="
git log -p --since="7 days ago" -- "$QUERY_FILE" | head -100
# Show the current state
echo ""
echo "=== CURRENT STATE OF THE PROBLEMATIC SECTION ==="
grep -B5 -A5 "SELECT.*FROM orders.*created_at" "$QUERY_FILE"Now you've connected the dots: the metrics showed a problem, the logs confirmed the pattern, the code validated the mechanism. You know exactly what broke and why.
The Full Workflow in Claude Code
In practice, you'd structure all of this as a reusable Claude Code workflow. Here's what that might look like:
#!/bin/bash
# full-behavior-first-debug.sh
set -e
# Phase 1: Establish the incident
echo "Phase 1: Gathering incident details..."
INCIDENT_START=$1
INCIDENT_END=$2
SERVICE=$3
if [ -z "$INCIDENT_START" ] || [ -z "$INCIDENT_END" ] || [ -z "$SERVICE" ]; then
echo "Usage: $0 <start-time> <end-time> <service-name>"
echo "Example: $0 '2024-03-16T13:45:00Z' '2024-03-16T13:52:00Z' 'orders-service'"
exit 1
fi
# Phase 2: Query metrics
echo ""
echo "Phase 2: Querying metrics dashboard..."
echo "Looking for anomalies in latency, error rate, and throughput..."
# (Query your metrics API here)
# Phase 3: Filter and correlate logs
echo ""
echo "Phase 3: Filtering logs for incident window..."
echo "Correlating with metric anomalies..."
# (Filter logs here)
# Phase 4: Build hypothesis
echo ""
echo "Phase 4: Forming hypothesis..."
echo "Based on metrics and logs, the most likely cause is:"
# (Summarize findings)
# Phase 5: Validate against code
echo ""
echo "Phase 5: Validating hypothesis against code..."
# (Search for specific patterns in code)
echo ""
echo "Investigation complete. Hypothesis: [SUMMARY]"The beauty of this approach is that it's repeatable. Every incident follows the same path: metrics → logs → hypothesis → code. You're building muscle memory around a systematic approach, not flailing around hoping to find the problem.
Why This Matters: Real-World Impact
Let me give you a concrete example from a system we worked with. A production API was occasionally returning 500 errors on a specific endpoint. The code looked fine—the logic was straightforward, no obvious bugs. The team spent a week adding logging and trying different fixes in code. Nothing helped.
When we switched to behavior-first debugging, here's what we found in thirty minutes:
- Metrics showed: 500 errors happened in bursts, always around the top of the hour
- Logs showed: Every 500 error was preceded by "connection lost to Redis"
- Code search revealed: A cache warming job was scheduled to run at :00 on every hour, and it was synchronous—blocking all requests until it completed
- Validation: The Redis instance sometimes took longer than the request timeout to respond, causing the whole endpoint to fail
The fix had nothing to do with the endpoint code. It was moving the cache warming to an async background job and adding a timeout. We would have never found it by reading the endpoint code because the problem wasn't there.
This is the hidden power of behavior-first debugging: it systematically eliminates layers of investigation. You don't waste time on the code until you've proven the problem is actually in the code.
Tools and Commands in Claude Code
When using Claude Code for behavior-first debugging, keep these commands in mind:
- Metrics queries: Use CLI tools to fetch from your metrics API (Prometheus, DataDog, CloudWatch)
- Log aggregation: Tools like
jq,awk,grepto parse and filter structured logs - Timeline visualization: Generate simple ASCII timelines to correlate events
- Git history:
git logto map code changes to incident timeframe - System inspection:
journalctl,syslog, kernel logs for infrastructure-level events - Request tracing:
curl,tcpdump, distributed tracing tools for following requests through layers
The principle is: use tools to gather data, use data to form hypotheses, use code to validate hypotheses.
Common Pitfalls and How to Avoid Them
Pitfall 1: Assuming "weird log entry" means "bug in code"
Logs are full of noise. A single WARN or ERROR doesn't necessarily indicate the root cause. You need patterns—multiple correlated signals. If there's one "connection timeout" message in a sea of successful logs, it's probably not the root cause. Look for signals that repeat consistently during the incident window.
Pitfall 2: Trusting logs from a downstream system too much
If service A calls service B, and service B logs "received invalid request from A," that's information about B's perception, not proof that A is sending something wrong. Always correlate with the other side's logs. Service A might have received a timeout or an error from B that prevented the request from being completed. One-sided log evidence is incomplete.
Pitfall 3: Missing the lag in metrics collection
Most metrics systems have 5-30 second latency. If your logs show an event at 13:46:00, the metrics spike might not show up until 13:46:15. Account for this lag when building timelines. Some systems have even worse latency: aggregated metrics can lag by minutes on some platforms.
Pitfall 4: Jumping to code before exhausting logs and metrics
This is the hardest habit to break. It feels productive to read code. It feels like you're making progress. But if you haven't narrowed the scope down to a specific component, you're probably wasting time. Stay disciplined. Commit to 15-20 minutes of metrics and log analysis before opening any code files.
Pitfall 5: Not documenting your hypothesis before looking at code
Write it down. "Here's what I think happened and here's why." This locks you into being scientific about validation instead of just confirming your assumption. It also gives you something to reference when debugging gets confusing—you can reread your original hypothesis and check if you're still on track.
Building Behavior-First Into Your Workflow
To make this systematic, establish a protocol:
- Incident detected: Pull the metrics, establish the timeline (5 minutes max)
- Hypothesis formation: Filter logs, correlate with metrics, write down your hypothesis (10 minutes max)
- Scope narrowing: Identify which component/layer is implicated (5 minutes max)
- Code validation: Search for specific patterns, validate the hypothesis (10 minutes max)
- Root cause confirmation: One last check—is there anything in logs/metrics that contradicts this? (5 minutes max)
Total time to root cause: under 45 minutes for most incidents. Compare that to the alternative of randomly reading code and hoping. Teams that follow this protocol report finding root causes 5-10x faster than their previous approach.
Advanced: Multi-Service Debugging with Behavior-First
Real systems are rarely single-service. When you've got microservices, distributed traces, and multiple layers of infrastructure, behavior-first debugging becomes even more critical. Here's why: in a distributed system, the "broken" service is often not the one where the problem originated.
Let's say you've got this architecture:
User Request → API Gateway → Order Service → Payment Service → External Processor
↓ ↓
└──────→ Redis Cache ←───────┘
(shared)
A request is slow. You could investigate the Order Service, the Payment Service, the External Processor, Redis, or the API Gateway. Which one is the real culprit?
With behavior-first debugging, you don't guess. You trace the behavior backwards:
- API Gateway metrics: Show increased latency across all services
- Per-service breakdown: Order Service p95 is normal (200ms), but requests are waiting 3+ seconds total
- Log correlation: Logs show requests are queuing at Redis, waiting for locks
- Redis inspection: Shows one key is locked by a slow write operation
- Code search: Find where that write operation happens
Now you know: the Payment Service is holding a Redis lock too long. The problem isn't the Order Service; it's a dependency chain issue. You wouldn't have found this by reading the Order Service code because the problem isn't there.
Claude Code is perfect for this kind of distributed tracing. You can script:
#!/bin/bash
# distributed-behavior-debug.sh
# Query metrics from each service
echo "=== SERVICE LATENCY BREAKDOWN ==="
services=("api-gateway" "order-service" "payment-service" "redis")
for service in "${services[@]}"; do
p95=$(curl -s "http://metrics-api/query?service=$service&metric=latency_p95" | jq .value)
echo "$service: ${p95}ms"
done
# Find which service's logs show the most errors in the incident window
echo ""
echo "=== ERROR CONCENTRATION BY SERVICE ==="
for service in "${services[@]}"; do
error_count=$(grep "ERROR" /var/log/$service.log | \
awk -v start=$INCIDENT_START_UNIX -v end=$INCIDENT_END_UNIX \
'$2 >= start && $2 <= end' | wc -l)
echo "$service: $error_count errors"
done
# Examine distributed traces for the slow request
echo ""
echo "=== DISTRIBUTED TRACE ANALYSIS ==="
# Query your trace backend (Jaeger, Zipkin, etc.)
curl -s "http://traces-api/trace?request_id=xyz" | jq '.spans[] | {service, duration_ms}'This gives you a cross-service view without reading any code. You see the pattern first, then drill in. Each service's metrics tell you where time is being spent. Each service's logs tell you what went wrong. Distributed traces show you the exact call sequence.
Metrics You Actually Need
Not all metrics are equal. For behavior-first debugging, focus on these key ones:
Request-Level Metrics (the top of your funnel):
- Latency percentiles (p50, p95, p99, p99.9) — tells you how users experience your system
- Error rate (by status code) — identifies failure rates and categories
- Throughput (requests per second) — detects saturation
- Request volume — useful for detecting rapid changes in load
Resource Metrics (the foundation):
- CPU utilization — when it hits 100%, everything slows down
- Memory utilization — out-of-memory scenarios cause crashes
- Disk I/O (read/write operations and latency) — disk saturation kills performance
- Network throughput and packet loss — infrastructure problems manifest here
Dependency Metrics (where things get interesting):
- Database query latency (p50, p95) — slow queries cascade upstream
- Connection pool utilization — exhaustion causes failures
- Cache hit rate — misses force expensive operations
- Queue length (if you use message queues) — growing queues indicate processing is slower than ingestion
- External API latency and error rates — upstream problems affect your system
Application-Specific Metrics (the secret sauce):
- Queue size (if processing jobs) — buildup indicates bottlenecks
- Batch processing time — identifies slow batch jobs
- Cache eviction rate — indicates cache pressure
- Lock contention (if you track this) — shows resource contention
- Feature flag evaluation latency — slow feature flags affect all requests
The mistake most teams make is collecting tons of metrics but not the useful ones. You don't need to know CPU to the decimal point; you need to know when CPU hits 90% and things start failing. You don't need per-second database metrics; you need to know when slow queries suddenly increase.
Start with the top funnel metrics: latency, error rate, throughput. Everything else should ladder down from those. Once you have those three working reliably, add dependency metrics. Once you have those, add application-specific metrics. Build incrementally; don't collect everything at once.
Logs: Structured Is Non-Negotiable
This is where many debugging sessions go sideways: unstructured logs. If your logs are free-form text, you're going to waste half your time parsing them.
With Claude Code, you can validate your logging structure:
#!/bin/bash
# validate-log-structure.sh
# Check: Are logs JSON or structured text?
head -20 /var/log/app.log | grep -E '^\{' && echo "✓ JSON logs detected" || echo "✗ Unstructured logs"
# Check: Do logs have timestamps?
head -20 /var/log/app.log | grep -E '\d{4}-\d{2}-\d{2}' && echo "✓ ISO 8601 timestamps" || echo "⚠ Check timestamp format"
# Check: Do logs have request IDs for tracing?
grep "request_id\|trace_id\|correlation_id" /var/log/app.log | wc -l > 0 && echo "✓ Request IDs present" || echo "✗ No request IDs"
# Check: Do logs have severity levels?
grep -E '\[ERROR\]|\[WARN\]|\[INFO\]' /var/log/app.log | wc -l > 0 && echo "✓ Severity levels present" || echo "⚠ Missing severity"
# Check: Do logs include context (service, version, environment)?
grep "service\|version\|environment" /var/log/app.log | wc -l > 0 && echo "✓ Context fields present" || echo "⚠ Missing context"If your logs aren't structured, make that your first investment. Unstructured logs are a debugging tax: you pay it every single time something breaks. Structured logs (JSON preferred) let you query, filter, and correlate programmatically. They make behavior-first debugging possible.
Advanced Multi-Layer Debugging
Correlating Across Infrastructure Layers
Real systems have multiple layers: load balancers, application servers, databases, caches, message queues. When something fails, the problem might manifest in one layer but originate in another.
Here's a strategy for systematic cross-layer investigation:
#!/bin/bash
# multi-layer-correlation.sh
# Define the incident window
INCIDENT_START="2024-03-16T13:45:00Z"
INCIDENT_END="2024-03-16T13:52:00Z"
echo "=== LAYER 1: Load Balancer / Ingress ==="
echo "Checking request distribution and errors..."
kubectl logs deployment/nginx-ingress --since=1h | grep "2024-03-16T13:4[567]" | \
awk '{print $1, $9}' | sort | uniq -c | sort -rn
echo ""
echo "=== LAYER 2: Application Server ==="
echo "Checking application-level errors..."
kubectl logs deployment/api-service --since=1h | grep -E "ERROR|exception" | \
grep "2024-03-16T13:4[567]" | head -10
echo ""
echo "=== LAYER 3: Database ==="
echo "Checking slow queries and connection issues..."
kubectl logs deployment/postgres --since=1h | grep -E "duration|connection" | \
grep "2024-03-16T13:4[567]" | tail -20
echo ""
echo "=== LAYER 4: Cache / Session Store ==="
echo "Checking Redis / memcached operations..."
kubectl logs deployment/redis-instance --since=1h | grep -E "SLOWLOG|eviction" | \
grep "2024-03-16T13:4[567]"
echo ""
echo "=== LAYER 5: Message Queue ==="
echo "Checking queue depth and processing lag..."
kubectl exec deployment/queue-consumer -- queue-stats --since=1h | \
grep "2024-03-16T13:4[567]"
echo ""
echo "=== LAYER 6: Dependencies / External Services ==="
echo "Checking outbound request failures..."
curl -s http://datadog-api/query?service=external-api \
"avg:trace.http.request.duration{service:external} over incident_window" | jq .The key insight: each layer produces different telemetry. Load balancers see HTTP status codes. Databases see query times. Caches see hit rates. By collecting data from all layers, you build a complete picture.
Dependency Chain Tracing
For distributed systems, following a single request through multiple services is invaluable. Modern platforms support distributed tracing (Jaeger, Zipkin). Here's how to use it effectively:
#!/bin/bash
# trace-request-through-system.sh
# Find a request that failed during the incident
REQUEST_ID=$(curl -s "http://logs/query?time=13:46:00&status=500" | jq -r '.results[0].trace_id')
echo "Tracing request $REQUEST_ID through the system..."
# Get the complete trace
curl -s "http://jaeger-api/traces/$REQUEST_ID" | jq '.data.traces[0].spans[] | {
service: .process.serviceName,
operation: .operationName,
duration_ms: .duration / 1000,
status: .tags[] | select(.key == "http.status_code") | .value
}'This shows you the exact path a request took through your system, how long it spent in each service, and where it failed. If a request spent 30 seconds in Service B when it should take 100ms, you've found the bottleneck.
Synthetic Transactions for Edge Cases
When real traffic doesn't reproduce a bug, synthetic transactions help. These are scripted requests designed to trigger specific code paths:
#!/bin/bash
# synthetic-transaction-test.sh
# Test 1: Authentication edge case
echo "Test 1: Slow authentication..."
time curl -X POST http://localhost:3000/auth/login \
-H "Content-Type: application/json" \
-d '{"username":"testuser","password":"longpassword1234567890"}'
# Test 2: Large payload handling
echo "Test 2: Large payload..."
python3 -c "import json; print(json.dumps({'data': 'x' * 1000000}))" | \
curl -X POST http://localhost:3000/api/process \
-H "Content-Type: application/json" \
-d @-
# Test 3: Concurrent requests
echo "Test 3: Concurrent load..."
for i in {1..100}; do
curl -s http://localhost:3000/api/status &
done
wait
# Test 4: Network partition simulation
echo "Test 4: Degraded dependencies..."
# Use iptables or tc (traffic control) to simulate latency
tc qdisc add dev eth0 root netem delay 5000ms
curl -s http://localhost:3000/api/external-dependency
tc qdisc del dev eth0 rootThese synthetic tests let you reproduce issues consistently, rather than waiting for real traffic to trigger them.
When Behavior-First Hits Its Limit
Behavior-first debugging is powerful, but it's not magic. There are edge cases where it struggles:
Intermittent bugs: If something fails once every 10,000 times, you might not see it in metrics. The p99 latency might be normal, the error rate might be imperceptible. In this case, you might need to dive into code earlier, but you should still look for patterns in logs first. Maybe you can correlate the few failures you do see with specific conditions.
Silent failures: If your system silently corrupts data without erroring, you won't see it in metrics. This is where data validation checks and integration tests matter more than metrics. You need to detect silent failures through explicit validation, not through monitoring.
Cascading failures: Sometimes the root cause happens at 14:00 but the symptoms don't appear until 14:47. Behavior-first debugging still helps, but you need to look further back in time. Extend your incident window backward and look for events that could have triggered the cascade.
Resource contention: If two services are competing for the same resource (CPU, disk, bandwidth), metrics might show both services are fine individually. You need to correlate metrics from different layers. Look at resource saturation at the host level, not just application level.
For these edge cases, behavior-first is still your starting point. You're just looking at logs and metrics from a wider time window, or correlating across more layers.
Scaling the Approach
What does behavior-first debugging look like at scale? With hundreds of services and thousands of metrics, you can't manually check everything.
The solution is anomaly detection. Instead of waiting for an alert, tools can automatically detect when metrics deviate from their baseline:
#!/bin/bash
# anomaly-detection-baseline.sh
# Establish baseline for last 30 days
BASELINE_WINDOW="30d"
CURRENT_WINDOW="1h"
# For each key metric, compute the baseline (e.g., average over 30 days)
for metric in latency_p95 error_rate throughput; do
baseline=$(curl -s "http://metrics-api/query?metric=$metric&range=$BASELINE_WINDOW" | jq '.avg')
current=$(curl -s "http://metrics-api/query?metric=$metric&range=$CURRENT_WINDOW" | jq '.avg')
# Calculate deviation
deviation=$(echo "scale=2; ($current - $baseline) / $baseline * 100" | bc)
# Alert if deviation > 20%
if (( $(echo "$deviation > 20" | bc -l) )); then
echo "ANOMALY: $metric deviated $deviation% from baseline"
fi
doneWith this, your monitoring system can automatically surface potential incidents, and you can run behavior-first debugging on them. No more waiting for a customer to complain or spotting problems during a routine dashboard review.
The Human Element
Here's something that doesn't get discussed enough: behavior-first debugging is as much about how you think as it is about tools.
It requires discipline. When you're paging on a production incident, your instinct is to do something—anything. Jumping straight to code feels actionable. Sitting with logs and metrics for five minutes feels like wasting time.
But those five minutes of careful analysis save you thirty minutes of wrong fixes.
At -iNet, we've trained teams on this mindset shift by making it a ritual:
- Declare the incident: "Something is wrong. Here's what we observe."
- Resist the urge: Don't jump to code yet.
- Gather data: Metrics and logs first.
- Write the hypothesis: Explicitly. Make it falsifiable.
- Validate: Only then look at code.
After a few incidents, teams stop jumping to code. They've seen how much faster the data-first approach is. The mental shift is real: you stop thinking of debugging as "find the bug in the code" and start thinking of it as "find what changed to cause this behavior."
Conclusion: Data First, Code Second
The shift from code-first to behavior-first debugging is subtle in theory but profound in practice. You're not abandoning code analysis; you're sequencing it correctly. Metrics and logs narrow the search space. Your hypothesis points you to specific code. You validate the hypothesis against the code.
This approach works because it respects the reality of systems: most failures are not in the code logic—they're in the interaction between code, infrastructure, data, and timing. Those interactions only show up in metrics and logs.
The next time something breaks in production, take a breath. Open your metrics dashboard. Build a timeline. Read the relevant logs. Write down your hypothesis. Then look at the code.
You'll find the problem faster, fix it more confidently, and build a stronger understanding of how your system actually behaves.
That's the power of behavior-first debugging. And with Claude Code automating the data gathering and correlation, it's faster than ever to implement.
The Cultural Shift: Moving Your Team to Behavior-First
Adopting behavior-first debugging as a team practice requires more than just knowing the technique. It requires shifting how your organization thinks about incidents. This is a cultural change, and culture changes slowly.
The biggest resistance you'll face comes from the "jumping to code" instinct. Developers are trained to solve problems by reading and modifying code. It feels productive. It feels like progress. Sitting with logs and metrics for fifteen minutes feels passive, like stalling.
The antidote is evidence. After your team successfully debugs one or two incidents using behavior-first, the evidence becomes irrefutable. They'll say things like: "We would have spent three hours poking around the codebase. This metrics-first approach took forty minutes." Once people see that evidence, they get it.
Another barrier: tooling. If your team doesn't have good observability tooling—if logs are unstructured and there's no metrics dashboard—behavior-first feels impossible. Invest in this first. Make sure you have:
- A metrics system (Prometheus, DataDog, New Relic, or cloud-native equivalent) that covers your key services
- Structured logging (JSON format, with request IDs for tracing)
- A log aggregation system (ELK stack, Datadog, CloudWatch, Splunk)
- Access for all developers to these systems
- Documentation on how to query them
Without these tools, you can't do behavior-first debugging effectively. But once they're in place, the behavior-first approach becomes almost inevitable. The data is just too useful to ignore.
Finally, there's the training piece. Your team needs to know what metrics and logs to look for, and they need to practice the workflow. Run a "debug simulation" during a team meeting. Create a fake incident (spike the error rate for a minute, trigger some specific error messages) and have your team use behavior-first to track it down. This builds confidence and surfaces questions in a safe environment.
Once your culture embraces behavior-first debugging, you'll notice a subtle but important shift: people become more thoughtful about observability as they build features. They ask questions like "What metrics should I emit to make this visible?" and "What would debugging this look like?" This mindset shift—treating observability as a first-class concern during development—is maybe the biggest win of all. It creates a virtuous cycle: better observability leads to faster debugging, which leads to more investment in observability, which makes debugging even faster.
Real-World Metrics That Matter
The metrics we discussed earlier are necessary but not sufficient. Different types of systems need different metrics. Here are domain-specific metrics that often reveal root causes:
For API Services: Beyond latency and error rate, track endpoint-specific latency (some endpoints are always slow, that's expected), request size (sometimes payload bloat causes slowness), and response time percentiles by status code (200s might be fast but 500s slow because of error handling overhead).
For Databases: Query count, slow query frequency (not just individual slow queries), connection wait time, lock contention, and transaction duration percentiles. A database that's thrashing on locks looks different from one with bad query plans.
For Async Workers: Queue depth, processing latency percentiles, dead-letter queue size, and batch processing time if applicable. A queue that's growing means workers are slower than ingestion. A growing dead-letter queue means errors are accumulating.
For Frontend Applications: Time to interactive, paint timing, JavaScript execution time, network request waterfalls. These metrics tell you if the problem is rendering, computation, or network. Tracking Core Web Vitals has become essential.
For Scheduled Jobs/Cron Tasks: Execution time percentiles, failure rate, and last successful run timestamp. A job that ran fine yesterday but failed today tells you something changed in the environment, not the code.
The key is: measure the things that matter for your specific system. Generic metrics are useful, but domain-specific metrics are where behavior-first debugging becomes truly powerful. They let you see patterns that wouldn't show up in generic monitoring.
-iNet