From Prompts to Pipelines: Automating Tasks with Claude Code Agents

What if you could hand off complex, multi-step workflows to AI and have them reliably execute, make decisions, and adapt, all without micromanaging every single step? That's what agents do. And if you're tired of writing the same prompts over and over, orchestrating agents is your next level up.
Most people think of Claude as a text-in, text-out tool. Ask a question, get an answer. Done. But agents flip the script. They reason through problems, take actions, observe the results, and iterate. They spawn specialized subagents to handle specific tasks. They coordinate with external tools via MCP. They can chain together entire workflows that would take you hours to manually execute.
In this article, we're going into the deep end. We're building production-grade multi-agent systems using Claude Code, and we're doing it right, with isolation, error handling, quality gates, and the kind of reliability that actually works in the real world.
Table of Contents
- The ReAct Pattern: How Agents Actually Think
- Built-In Agents: Your Foundation
- Creating Custom Subagents: The Real Power
- Role
- Capabilities
- Constraints
- Context Isolation: Why It Matters
- Orchestrating Multi-Agent Workflows
- MCP Integration: Connecting to the Outside World
- MCP Resources
- Hooks: Building Quality Gates Into Your Pipeline
- Building Reliable, Repeatable Pipelines
- Task
- Input Format
- Output Format
- Constraints
- Best Practices for Production Agents
- Putting It All Together: A Real Pipeline
- Summary: The Agent Revolution
The ReAct Pattern: How Agents Actually Think
Before we jump into code, you need to understand the mental model. Agents don't work like traditional programs where you give instructions and get output. They work using the ReAct pattern: Reason, Act, Observe, Repeat.
Here's the loop:
- Reason: The agent reads the task, analyzes what needs to happen, and decides what action to take.
- Act: The agent takes an action (spawning a tool, running code, making a decision, delegating to another agent).
- Observe: The agent reads the result of its action and evaluates whether it worked.
- Repeat: Based on what it observed, it decides the next step. Maybe it needs to take another action, try a different approach, or declare the task complete.
This cycle continues until the agent either completes the task or hits a failure state. The magic is that the agent can adapt in real-time. If something doesn't work, it notices and adjusts.
Think of it like a developer working on a bug. Read the error, make a change, run tests, check the results, decide next steps. Agents do exactly that, except they're doing it with information, code, external tools, and other agents.
Built-In Agents: Your Foundation
Claude Code comes with three heavyweight agents out of the box:
Explore Agent: This one knows your codebase. Feed it a question like "find all authentication handlers" or "where do we validate user input?" and it systematically searches files, builds a mental model, and reports back. It's like having a senior developer who can instantly grok any unfamiliar repo.
Plan Agent: Give it a feature request or architecture question, and it breaks it down into executable steps. It analyzes dependencies, identifies risks, and produces a plan you can actually follow. It's less "here's what you should do" and more "here's how to do it properly."
General-Purpose Agent: The Swiss Army knife. It reasons about tasks, decides whether to read code, run commands, check databases, or call other agents. It's your orchestrator for tasks that don't fit a specific mold.
You access these through the Task tool. Instead of doing everything yourself, you delegate. The agent operates with isolated context, meaning it doesn't bloat your main conversation. You ask it something, it goes off and figures it out, and you get a summary back. Clean. Efficient.
Task: Explore the authentication module
Agent: Explore
Isolation: Yes (uses own context window)
Result: Returns findings without token overhead
Creating Custom Subagents: The Real Power
Here's where it gets interesting. These built-in agents are great, but they can't know about your specific domain, your architecture decisions, or the gotchas unique to your system.
That's why you create custom subagents. These are specialized agents you define in .claude/agents/ that understand your specific problem space.
Let's say you're building a data validation system. Instead of asking a general-purpose agent "validate this data," you create a ValidationAgent that understands your exact validation rules, your error taxonomy, and your domain constraints.
Here's what that looks like:
# .claude/agents/validation-agent.md
---
name: validation-agent
type: specialized
purpose: Validate data against domain-specific rules
---
## Role
You are a data validation specialist for our platform. You understand:
- Our schema definitions in /schemas/
- Our business rules in /rules/
- Our error taxonomy in /docs/errors.md
- Previous validation decisions (for consistency)
## Capabilities
- Parse and validate JSON/YAML data
- Check business rule violations
- Return structured validation results
- Flag ambiguous cases for human review
## Constraints
- Never auto-fix data; only validate
- Return detailed error messages with line numbers
- Flag severity: ERROR, WARNING, INFOWhen you spawn this agent, it has explicit context about validation. It's not guessing. It's operating from a documented set of rules.
You spawn it with the Task tool:
Task: Validate this payload against our schema
Agent: validation-agent
Payload: [JSON data]
The agent returns structured results. No ambiguity. No hallucination about rules it doesn't know.
Context Isolation: Why It Matters
Here's a gotcha nobody talks about: if you have your orchestrator agent do everything itself, its context window gets absolutely packed. It's reading files, considering options, making decisions, all in its own head. Token usage explodes. Response time tanks. You're paying for a bloated conversation.
Context isolation fixes this. When you spawn a subagent, it gets its own isolated context window. It operates independently, then reports back. The orchestrator stays lean.
Think of it like this: the orchestrator is the manager. It doesn't need to know how to do every single task. It needs to know which specialist to call, what to ask them, and how to use the result. The specialists each have their domain knowledge in their own context.
Orchestrator Context: ~500 tokens (task definitions, decisions)
Subagent A Context: 1000 tokens (domain-specific knowledge)
Subagent B Context: 800 tokens (different domain)
Subagent C Context: 900 tokens (third domain)
Total tokens: 3200
If orchestrator did everything: 8000+ tokens
Savings: 60%+
This matters in production. Faster responses. Lower cost. Better quality (because agents aren't context-starved and making dumb mistakes).
Orchestrating Multi-Agent Workflows
Now we're getting to the fun part. You have multiple specialized agents. How do you coordinate them?
Let's build a real example: Code Review Pipeline. When you submit code, it needs to go through multiple checks:
- Correctness (does it compile, does it work?)
- Security (no vulns, no secrets exposed?)
- Performance (no bottlenecks, efficient algorithm?)
- Style (consistent with house standards?)
- Documentation (is it understandable?)
Instead of one agent trying to do all five, you create five specialized agents and orchestrate them.
flowchart TD
R["Router Agent<br/>Reads submission, decides workflow"]
C["Correctness Agent<br/>Runs tests, checks syntax"]
S["Security Agent<br/>Scans for vulns, checks secrets"]
P["Performance Agent<br/>Profiles, analyzes complexity"]
St["Style Agent<br/>Lints, checks conventions"]
D["Documentation Agent<br/>Reviews clarity, examples"]
Agg["Aggregator Agent<br/>Combines results, prioritizes issues"]
Rep["Report Agent<br/>Formats findings for human review"]
R --> C & S & P & St & D
C & S & P & St & D --> Agg --> Rep
style R fill:#3b82f6,stroke:#0f172a,stroke-width:2px,color:#0f172a
style C fill:#22c55e,stroke:#0f172a,stroke-width:2px,color:#0f172a
style S fill:#ef4444,stroke:#0f172a,stroke-width:2px,color:#0f172a
style P fill:#f59e0b,stroke:#0f172a,stroke-width:2px,color:#0f172a
style St fill:#8b5cf6,stroke:#0f172a,stroke-width:2px,color:#0f172a
style D fill:#06b6d4,stroke:#0f172a,stroke-width:2px,color:#0f172a
style Agg fill:#ec4899,stroke:#0f172a,stroke-width:2px,color:#0f172a
style Rep fill:#3b82f6,stroke:#0f172a,stroke-width:2px,color:#0f172aThis isn't sequential. It's parallel. While the Security Agent is running its scans, the Performance Agent is profiling. By the time they all finish, you have complete analysis in a fraction of the time.
Here's how you implement this:
# Pseudocode for orchestrator
async def code_review_pipeline(submission):
# Step 1: Route
router = await spawn_agent("router-agent", submission)
config = await router.analyze()
# Step 2: Parallel execution
tasks = [
spawn_agent("correctness-agent", submission, config),
spawn_agent("security-agent", submission, config),
spawn_agent("performance-agent", submission, config),
spawn_agent("style-agent", submission, config),
spawn_agent("documentation-agent", submission, config),
]
results = await asyncio.gather(*tasks)
# Step 3: Aggregate
aggregator = await spawn_agent("aggregator-agent", results)
prioritized = await aggregator.rank_issues()
# Step 4: Report
reporter = await spawn_agent("report-agent", prioritized)
report = await reporter.format()
return reportEach agent runs independently with its own context. The orchestrator coordinates. The result is comprehensive analysis in seconds.
MCP Integration: Connecting to the Outside World
Here's the thing: agents can reason about your code and files, but they can't access external data on their own. That's where MCP (Model Context Protocol) comes in.
MCP is essentially a standard way to connect Claude to external tools, APIs, and data sources. Think of it as a bridge. Your agent says "I need to check the database" and MCP makes that request, gets the data, and feeds it back to the agent.
Common MCP integrations:
- GitHub MCP: Agents can query repos, read issues, check PRs
- Database MCP: Agents can run read-only queries, check schemas
- Slack MCP: Agents can read messages, understand context from conversations
- Weather API MCP: Agents can get real-time data
- Custom MCPs: You build them for your own systems
Let's say your validation agent needs to check if a user ID exists in your database before proceeding. Without MCP, it can't. With MCP:
Agent: "Is user_id=42 valid?"
↓
MCP Handler: Executes query against database
↓
Result: "Yes, user exists and status=active"
↓
Agent: "Great, I can proceed with validation"
The agent doesn't care about connection strings or SQL syntax. It just asks, MCP handles it.
Setting up MCP looks like this:
# .claude/mcp-servers.yaml
servers:
database:
type: postgresql
host: localhost
port: 5432
database: production
read_only: true
github:
type: github
token: ${GITHUB_TOKEN}
org: my-companyThen in your agent, you reference MCP:
## MCP Resources
- database: Read-only queries to check data integrity
- github: Query issues and PRs for context
When validating data, use MCP:
1. Check if referenced IDs exist (database)
2. Look for related GitHub issues (github)
3. Return validation result with context
The agent automatically gets access. No manual API calls. No parsing responses. Just clean integration.
Hooks: Building Quality Gates Into Your Pipeline
Remember those quality gates from earlier? "Never deploy code that fails tests." "Always document breaking changes." "Flag security issues."
You don't want to manually check these every time. You want them automated. That's what hooks are for.
Hooks are events that fire at specific points in your workflow. Pre-task, post-task, before deployment, after testing, whatever matters to you.
# .claude/hooks/pre-task.yaml
---
name: validate-input
event: pre-task
triggers:
- task_type: code-review
- task_type: deployment
actions:
- spawn: security-scanner
check: secrets-exposed
block_on_failure: true
- spawn: schema-validator
check: input-matches-schema
block_on_failure: true
- log: hooks-executed
destination: .claude/loop/evidence_log.mdWhen a task starts, these hooks fire automatically. If the security scanner finds exposed secrets, the task stops. The agent doesn't proceed. A human has to review first.
This is how you build reliability. Not by hoping things are okay, but by making the pipeline enforce correctness automatically.
Post-task hooks are equally powerful:
# .claude/hooks/post-task.yaml
---
name: validate-output
event: post-task
actions:
- spawn: quality-checker
validate: output-quality
minimum_score: 3.0
- spawn: continuity-checker
validate: consistency-with-codebase
- spawn: test-runner
validate: all-tests-pass
block_on_failure: true
- commit-and-notify:
if: all-validations-pass
action: commit
message: "Auto-commit via post-task hook"
notify: slack
channel: "#releases"The task finishes. Post-task hooks automatically validate the output. If everything passes, it commits. If something fails, it notifies the team and waits for human review.
You've just built a self-healing, self-checking pipeline.
Building Reliable, Repeatable Pipelines
Theory is nice. Here's how you make this work in practice.
First: Explicit prompts. Your agents need crystal-clear instructions. Not "validate this data" but "validate this data according to rules in /schemas/rules.yaml, return results in JSON format with fields: status, errors[], warnings[]".
# .claude/agents/data-validator.md
## Task
Validate input data against defined schemas
## Input Format
```json
{
"data": {...},
"schema_name": "string",
"strict_mode": boolean
}Output Format
{
"status": "valid" | "invalid" | "warning",
"errors": [{"field": "...", "message": "...", "code": "..."}],
"warnings": [{"field": "...", "message": "..."}],
"metadata": {"validated_at": "iso8601", "schema_version": "..."}
}Constraints
- Return valid JSON (parse it before returning)
- Never return null fields (use empty arrays)
- Always include metadata
- Flag severity: ERROR, WARNING, INFO
Explicit. Unambiguous. No room for interpretation.
**Second: Checkpoint state**. Don't assume state persists. Explicitly pass state between steps.
```python
# Right way
step1_result = await spawn_agent("step1", task)
step2_result = await spawn_agent("step2",
task=task,
previous_results=step1_result, # Explicit state pass
checkpoint_id=checkpoint_id
)
# Wrong way
step1_result = await spawn_agent("step1", task)
step2_result = await spawn_agent("step2", task) # Lost step1 context
Third: Error handling. Agents fail. Data is corrupt. APIs timeout. You need explicit failure modes.
try:
result = await spawn_agent("risky-operation", task)
except AgentFailureError as e:
if e.is_recoverable:
# Retry with backoff
await retry_with_backoff(spawn_agent, ("risky-operation", task))
else:
# Escalate to human
await notify_human(f"Unrecoverable failure: {e.message}")
raiseFourth: Cost monitoring. Multi-agent workflows can get expensive fast. Track token usage.
# .claude/monitoring/cost-tracking.yaml
track_agents:
- validation-agent
- security-agent
- performance-agent
alerts:
- if: monthly_cost > $500
action: notify
severity: warning
- if: single_task_cost > $5
action: notify
severity: warning
- if: agent_avg_cost > expected_cost * 1.5
action: investigate
severity: alertYou monitor costs like you'd monitor performance metrics. Because they're both critical in production.
Best Practices for Production Agents
Let me give you the patterns that actually work when you're running agents in production:
1. Version your agents. As you refine agents, their behavior changes. Track versions.
# .claude/agents/validation-agent.md
---
version: 2.1.0
previous_versions:
- 2.0.0 (added business_rule_5)
- 1.5.0 (fixed edge case in email validation)
changelog: |
v2.1.0: Improved error messages, added context hints
v2.0.0: Extended validation rules for new schemaWhen you deploy, you know exactly which agent version is running.
2. Log everything. Not just errors. Decision points, transitions, timing.
logger.debug(f"Task {task_id}: Spawning {agent_name} v{version}")
logger.debug(f" Context size: {context_tokens} tokens")
logger.debug(f" Timeout: {timeout}s")
result = await spawn_agent(agent_name, task)
logger.info(f"Task {task_id}: {agent_name} completed")
logger.info(f" Result: {result.status}")
logger.info(f" Duration: {result.duration}ms")
logger.info(f" Tokens used: {result.tokens}")Three months later, when something goes wrong, you have a complete trace.
3. Test your agents. Write test cases for each agent with known inputs and expected outputs.
# tests/test_validation_agent.py
test_cases = [
{
"name": "Valid user data",
"input": {"email": "user@example.com", "age": 25},
"expected": {"status": "valid", "errors": []}
},
{
"name": "Invalid email",
"input": {"email": "not-an-email", "age": 25},
"expected": {"status": "invalid",
"errors": [{"field": "email", "code": "INVALID_FORMAT"}]}
},
{
"name": "Age out of range",
"input": {"email": "user@example.com", "age": 150},
"expected": {"status": "invalid",
"errors": [{"field": "age", "code": "OUT_OF_RANGE"}]}
}
]
for test in test_cases:
result = await spawn_agent("validation-agent", test["input"])
assert result == test["expected"], f"Failed: {test['name']}"Run these before any production deployment. Catch regressions immediately.
4. Implement timeouts. Agents can hang. Always set timeouts.
result = await asyncio.wait_for(
spawn_agent("analysis-agent", task),
timeout=30.0 # 30 second max
)If the agent doesn't complete in time, you fail gracefully instead of waiting forever.
5. Monitor quality metrics. Not just success/failure. Quality of results.
# Quality metrics
quality_checks:
- agent: validation-agent
metric: accuracy
calculation: correctly_flagged_errors / total_errors
target: 0.99
- agent: security-agent
metric: false_positive_rate
calculation: incorrect_alerts / total_alerts
target: 0.05
- agent: code-review-agent
metric: helpfulness
calculation: developer_found_feedback_useful / total_reviews
target: 0.85When an agent's quality dips, you investigate. Maybe the ruleset changed. Maybe the model needs different prompting. Either way, you catch it.
Putting It All Together: A Real Pipeline
Let's build one complete, production-ready pipeline. Let's say you're processing user signups. Each signup needs to:
- Validate schema (schema validation agent)
- Check against security rules (security agent)
- Verify email deliverability (email validation agent via MCP)
- Check for fraud signals (fraud detection agent)
- Create user record if all pass (database agent via MCP)
- Log results (audit agent)
Here's how it works:
async def process_signup(signup_data):
task_id = generate_id()
checkpoint = {"task_id": task_id, "started_at": now()}
# Stage 1: Validation
try:
schema_result = await spawn_agent(
"schema-validator",
input=signup_data,
checkpoint=checkpoint
)
checkpoint["schema_valid"] = schema_result.status == "valid"
if not checkpoint["schema_valid"]:
return error_response("Invalid schema", schema_result.errors)
except AgentFailureError as e:
logger.error(f"Schema validation failed: {e}")
return error_response("Validation error", str(e))
# Stage 2: Parallel security and email checks
security_task = spawn_agent(
"security-validator",
input=signup_data,
checkpoint=checkpoint
)
email_task = spawn_agent(
"email-validator",
input=signup_data,
checkpoint=checkpoint,
use_mcp="email-verification"
)
security_result, email_result = await asyncio.gather(
security_task,
email_task,
return_exceptions=True
)
checkpoint["security_passed"] = security_result.status == "passed"
checkpoint["email_valid"] = email_result.is_deliverable
if not checkpoint["security_passed"]:
return error_response("Security check failed", security_result.issues)
if not checkpoint["email_valid"]:
return error_response("Invalid email", email_result.reason)
# Stage 3: Fraud detection
fraud_result = await spawn_agent(
"fraud-detector",
input=signup_data,
checkpoint=checkpoint,
use_mcp=["ip-reputation", "email-reputation"]
)
checkpoint["fraud_score"] = fraud_result.score
if fraud_result.score > 0.7:
logger.warning(f"Fraud alert: {task_id}, score={fraud_result.score}")
checkpoint["needs_manual_review"] = True
# Stage 4: Create user (if all checks pass)
if checkpoint.get("needs_manual_review"):
return pending_response("Review required", task_id)
create_result = await spawn_agent(
"user-creator",
input=signup_data,
checkpoint=checkpoint,
use_mcp="database"
)
checkpoint["user_created"] = create_result.user_id
checkpoint["completed_at"] = now()
# Stage 5: Audit
await spawn_agent(
"audit-logger",
checkpoint=checkpoint,
use_mcp="audit-database"
)
logger.info(f"Signup processed: {task_id}, user={create_result.user_id}")
return success_response(create_result.user_id)This is a complete, production-grade pipeline:
- Explicit stages with error handling
- Parallel execution where possible (security + email)
- State checkpointing between stages
- MCP integration for external data
- Logging and audit trails
- Human escalation for edge cases
It's reliable because failures are explicit. It's efficient because stages run in parallel. It's maintainable because each agent has a single responsibility.
Summary: The Agent Revolution
You're not writing procedural code anymore. You're composing intelligent, specialized agents that reason through problems, integrate with external systems, and scale horizontally.
The ReAct pattern (Reason → Act → Observe → Repeat) is how agents think. Context isolation keeps your orchestrator lean. Custom agents bring domain expertise. MCP integration connects to the real world. Hooks enforce quality gates. Multi-agent orchestration handles complexity.
In production, you use explicit prompts, checkpoint state, handle errors, monitor costs, version agents, log decisions, test thoroughly, enforce timeouts, and track quality metrics.
Build it right, and you've got a system that's faster than manual work, more reliable than brittle scripts, and more adaptable than hard-coded logic.
That's the power of agents. Now go build something remarkable.