March 10, 2026
AI Claude Automation

Building Autonomous Agents with Claude Tools

Here's what nobody tells you about building AI agents: the agent logic is the easy part. The tool definitions are where you'll either ship something reliable or spend three weeks debugging hallucinated function calls at 2 AM.

I've built agents that orchestrate data pipelines, handle customer support escalations, and manage infrastructure deployments—all powered by Claude's tool use capabilities. The pattern that separates agents that work from agents that catastrophically fail in production comes down to one thing: how well you define the boundaries of what the agent can and cannot do.

We're going deep on autonomous agent architecture with Claude. Not the "hello world" chatbot stuff. Real decision loops, real error recovery, real guardrails. Let's get into it.

Table of Contents
  1. What Makes an Agent an Agent
  2. Tool Use Fundamentals: The Foundation of Everything
  3. The Agent Decision Loop
  4. Error Recovery: When Things Go Sideways
  5. Real-World Agent: Data Pipeline Orchestrator
  6. Agent Guardrails: Preventing Catastrophe
  7. Confirmation Gates for Risky Actions
  8. Scope Limiting
  9. Token and Cost Budgets
  10. DevOps Agent: A Complete Example
  11. Common Pitfalls and How to Avoid Them
  12. Where This Is All Heading

What Makes an Agent an Agent

Before we write a line of code, let's get clear on terminology. A chatbot takes input and returns output. An agent takes input, reasons about what to do, takes action, observes the result, and decides what to do next. It loops. It adapts. It makes decisions without you holding its hand at every step.

The core architecture looks like this:

  1. Observe — receive input or read the result of a previous action
  2. Think — reason about what to do next given the current state
  3. Act — call a tool or produce output
  4. Observe — check the result of that action
  5. Repeat — continue until the task is complete or a termination condition is hit

That observe-think-act loop is the heartbeat of every agent. Claude is particularly well-suited for this because its tool use implementation is native to the model—it doesn't bolt on function calling as an afterthought. Claude reasons about when to use tools, which tools to use, and what parameters to pass as part of its core inference process.

The implication? Your tool definitions aren't just API specs. They're part of the prompt. They shape how Claude thinks about the problem. This matters more than you probably realize.

Tool Use Fundamentals: The Foundation of Everything

Tool use in Claude works through a schema-based system. You define tools with names, descriptions, and parameter schemas. Claude decides when to call them based on the conversation context and task requirements.

Here's the anatomy of a well-defined tool:

json
{
  "name": "query_database",
  "description": "Execute a read-only SQL query against the production analytics database. Returns up to 1000 rows. Use this when the user asks questions about metrics, user behavior, or historical data. Do NOT use this for mutations—use update_database for writes. Queries longer than 30 seconds will timeout.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "A valid PostgreSQL SELECT query. Must include a LIMIT clause. No DDL or DML statements allowed."
      },
      "database": {
        "type": "string",
        "enum": ["analytics", "metrics", "user_events"],
        "description": "Which database to query. 'analytics' for aggregate data, 'metrics' for system performance, 'user_events' for raw event logs."
      },
      "timeout_seconds": {
        "type": "integer",
        "default": 10,
        "description": "Query timeout in seconds. Max 30. Set higher for complex aggregations."
      }
    },
    "required": ["query", "database"]
  }
}

Notice what's happening in that description field. It's not just "runs a SQL query." It tells Claude:

  • When to use the tool (questions about metrics, user behavior, historical data)
  • When NOT to use the tool (mutations go elsewhere)
  • Constraints (read-only, 1000 row limit, 30 second timeout)
  • Disambiguation (which database enum means what)

This is the hidden layer that most people miss entirely. Vague tool descriptions produce vague tool usage. "Queries a database" tells Claude nothing about when to pick this tool over another, what guardrails exist, or what to expect. You get hallucinated column names, missing LIMIT clauses, and queries against the wrong database.

Invest 80% of your time in tool design, 20% in agent logic. I'm not exaggerating. The tool definitions are doing the heavy lifting. They're the rails your agent runs on.

The Agent Decision Loop

Now let's build the actual loop. Here's a production-grade agent pattern in Python using the Anthropic SDK:

python
import anthropic
import json
 
client = anthropic.Anthropic()
 
TOOLS = [
    {
        "name": "search_knowledge_base",
        "description": "Search internal documentation and knowledge base articles. Returns the top 5 most relevant results with snippets. Use this FIRST before attempting to answer questions about company policies, product features, or internal processes.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Natural language search query"
                },
                "category": {
                    "type": "string",
                    "enum": ["product", "billing", "technical", "policy", "all"],
                    "description": "Filter results by category. Use 'all' when unsure."
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "create_support_ticket",
        "description": "Create a new support ticket for issues that require human follow-up. Use this when: (1) the issue cannot be resolved through knowledge base answers, (2) the user explicitly requests human support, or (3) the issue involves account-level changes like refunds or plan modifications. Always confirm with the user before creating a ticket.",
        "input_schema": {
            "type": "object",
            "properties": {
                "subject": {"type": "string", "description": "Brief ticket subject line"},
                "description": {"type": "string", "description": "Detailed description of the issue"},
                "priority": {
                    "type": "string",
                    "enum": ["low", "medium", "high", "urgent"],
                    "description": "Ticket priority. Use 'urgent' only for service outages or security issues."
                },
                "customer_email": {"type": "string", "description": "Customer's email address"}
            },
            "required": ["subject", "description", "priority", "customer_email"]
        }
    },
    {
        "name": "lookup_account",
        "description": "Look up a customer account by email or account ID. Returns account status, plan type, and recent activity. Use this when you need to verify account details or check subscription status.",
        "input_schema": {
            "type": "object",
            "properties": {
                "identifier": {"type": "string", "description": "Email address or account ID"},
                "identifier_type": {
                    "type": "string",
                    "enum": ["email", "account_id"],
                    "description": "Whether the identifier is an email or account ID"
                }
            },
            "required": ["identifier", "identifier_type"]
        }
    }
]
 
def execute_tool(tool_name: str, tool_input: dict) -> str:
    """Route tool calls to actual implementations."""
    if tool_name == "search_knowledge_base":
        return search_kb(tool_input["query"], tool_input.get("category", "all"))
    elif tool_name == "create_support_ticket":
        return create_ticket(tool_input)
    elif tool_name == "lookup_account":
        return lookup_account(tool_input["identifier"], tool_input["identifier_type"])
    else:
        return json.dumps({"error": f"Unknown tool: {tool_name}"})
 
def run_agent(user_message: str, max_iterations: int = 10):
    """Main agent loop with bounded iteration."""
    messages = [{"role": "user", "content": user_message}]
 
    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system="You are a customer support agent. Be helpful, concise, and empathetic. Always search the knowledge base before answering product questions. Never guess at account details—look them up.",
            tools=TOOLS,
            messages=messages
        )
 
        # Check if Claude wants to use tools
        if response.stop_reason == "tool_use":
            # Process all tool calls in this response
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"[Agent] Calling tool: {block.name}")
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })
 
            # Add assistant response and tool results to conversation
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
 
        elif response.stop_reason == "end_turn":
            # Agent is done—extract final text response
            final_text = "".join(
                block.text for block in response.content if hasattr(block, "text")
            )
            print(f"[Agent] Completed in {iteration + 1} iterations")
            return final_text
 
    return "I wasn't able to resolve this within the expected number of steps. Let me create a ticket for human follow-up."

Let's break down what's happening here. The run_agent function is the decision loop. Each iteration, Claude receives the current conversation state—including all previous tool calls and their results—and decides what to do next. It either calls another tool (stop_reason is "tool_use") or produces a final response (stop_reason is "end_turn").

The max_iterations parameter is critical. Without it, you've got a runaway loop that burns through your API budget and never terminates. I've seen agents get stuck in cycles—searching the knowledge base, not finding what they want, searching again with slightly different terms, forever. Cap it. Ten iterations is generous for most use cases. Five is often enough.

Error Recovery: When Things Go Sideways

Production agents fail. APIs timeout. Databases go down. Tool calls return garbage. Your agent needs to handle all of this gracefully, not crash and leave the user staring at a 500 error.

Here's an error recovery pattern that actually works in production:

python
import time
from typing import Optional
 
class ToolExecutionError(Exception):
    def __init__(self, tool_name: str, message: str, retryable: bool = False):
        self.tool_name = tool_name
        self.message = message
        self.retryable = retryable
        super().__init__(message)
 
def execute_tool_with_recovery(
    tool_name: str,
    tool_input: dict,
    max_retries: int = 3,
    base_delay: float = 1.0
) -> str:
    """Execute a tool call with retry logic and fallback behavior."""
 
    last_error: Optional[Exception] = None
 
    for attempt in range(max_retries):
        try:
            result = execute_tool(tool_name, tool_input)
 
            # Validate the result isn't empty or malformed
            parsed = json.loads(result) if isinstance(result, str) else result
            if parsed is None:
                raise ToolExecutionError(
                    tool_name, "Tool returned null result", retryable=True
                )
 
            return result
 
        except ToolExecutionError as e:
            last_error = e
            if not e.retryable or attempt == max_retries - 1:
                break
            delay = base_delay * (2 ** attempt)  # Exponential backoff
            print(f"[Recovery] {tool_name} failed (attempt {attempt + 1}), retrying in {delay}s")
            time.sleep(delay)
 
        except TimeoutError:
            last_error = ToolExecutionError(tool_name, "Tool call timed out", retryable=True)
            if attempt == max_retries - 1:
                break
            delay = base_delay * (2 ** attempt)
            time.sleep(delay)
 
        except Exception as e:
            last_error = e
            break  # Unknown errors are not retryable
 
    # All retries exhausted—return error context to Claude
    # so it can reason about what to do next
    error_response = {
        "error": True,
        "tool": tool_name,
        "message": str(last_error),
        "suggestion": get_fallback_suggestion(tool_name)
    }
    return json.dumps(error_response)
 
def get_fallback_suggestion(tool_name: str) -> str:
    """Provide Claude with fallback guidance when a tool fails."""
    fallbacks = {
        "search_knowledge_base": "The knowledge base is temporarily unavailable. Provide a general answer based on your training and let the user know you couldn't verify against current documentation.",
        "lookup_account": "Account lookup is failing. Ask the user to provide their account details manually, or offer to create a support ticket.",
        "create_support_ticket": "Ticket creation is failing. Provide the user with the support email address (support@example.com) as an alternative."
    }
    return fallbacks.get(tool_name, "This tool is currently unavailable. Consider alternative approaches or inform the user.")

The key insight here is that when a tool fails, you don't just return an error code. You return the error to Claude with a suggestion for what to do instead. Claude then reasons about the failure and adapts—maybe it tries a different approach, maybe it tells the user about the problem, maybe it falls back to a manual process.

This is where agents get genuinely interesting. They don't just follow a script. They handle the unexpected. The error response becomes part of the conversation context, and Claude's next decision is informed by what went wrong.

Real-World Agent: Data Pipeline Orchestrator

Let me show you what a more complex agent looks like. This one orchestrates a data pipeline—it checks data quality, runs transformations, validates outputs, and reports results:

python
PIPELINE_TOOLS = [
    {
        "name": "check_data_freshness",
        "description": "Check when source data tables were last updated. Returns the last update timestamp for each table. Use this BEFORE running any transformations to ensure you're working with recent data. If any table is stale (>24 hours), flag it and ask whether to proceed.",
        "input_schema": {
            "type": "object",
            "properties": {
                "tables": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "List of source table names to check"
                }
            },
            "required": ["tables"]
        }
    },
    {
        "name": "run_dbt_model",
        "description": "Execute a specific dbt model transformation. Returns row counts, execution time, and any warnings. This modifies data in the warehouse—use check_data_freshness first. Models can take up to 5 minutes for large tables.",
        "input_schema": {
            "type": "object",
            "properties": {
                "model_name": {"type": "string", "description": "dbt model name (e.g., 'stg_orders', 'fct_revenue')"},
                "full_refresh": {"type": "boolean", "default": False, "description": "Whether to do a full refresh instead of incremental. Use sparingly—full refreshes on large tables are expensive."}
            },
            "required": ["model_name"]
        }
    },
    {
        "name": "run_data_quality_check",
        "description": "Run data quality assertions against a table. Returns pass/fail status for each assertion (not null, unique, referential integrity, range checks). Always run this AFTER a transformation completes.",
        "input_schema": {
            "type": "object",
            "properties": {
                "table_name": {"type": "string"},
                "checks": {
                    "type": "array",
                    "items": {"type": "string", "enum": ["not_null", "unique", "referential_integrity", "row_count", "value_range"]},
                    "description": "Which quality checks to run"
                }
            },
            "required": ["table_name", "checks"]
        }
    },
    {
        "name": "send_notification",
        "description": "Send a Slack notification about pipeline status. Use this to report completion, failures, or data quality issues. Include relevant metrics in the message.",
        "input_schema": {
            "type": "object",
            "properties": {
                "channel": {"type": "string", "enum": ["#data-alerts", "#data-team", "#engineering"]},
                "message": {"type": "string"},
                "severity": {"type": "string", "enum": ["info", "warning", "error"]}
            },
            "required": ["channel", "message", "severity"]
        }
    }
]

When you feed this agent a task like "run the daily revenue pipeline," it will autonomously: check data freshness on source tables, run the staging models first, validate their output, run the fact table models, validate again, and then send a Slack notification with the results. If data freshness checks fail, it flags the issue and asks whether to proceed. If a quality check fails after transformation, it reports the specific failures and doesn't continue downstream.

This is the power of well-defined tools. The agent's behavior emerges from the tool descriptions. The "use this BEFORE" and "always run this AFTER" directives in the descriptions create an implicit workflow without you coding an explicit state machine.

Agent Guardrails: Preventing Catastrophe

Autonomous agents need boundaries. Without them, you're giving a very enthusiastic intern the keys to your production infrastructure. Here are the guardrails that have saved me from actual disasters:

Confirmation Gates for Risky Actions

python
CONFIRMATION_REQUIRED = {
    "create_support_ticket": "low",    # Always confirm with user
    "run_dbt_model": "medium",         # Confirm if full_refresh=True
    "delete_records": "high",          # Always require explicit approval
    "deploy_service": "critical"       # Require approval + reason
}
 
def should_require_confirmation(tool_name: str, tool_input: dict) -> bool:
    """Determine if a tool call needs human confirmation."""
    risk_level = CONFIRMATION_REQUIRED.get(tool_name)
 
    if risk_level is None:
        return False  # Read-only tools don't need confirmation
 
    if risk_level == "critical":
        return True  # Always confirm
 
    if risk_level == "high":
        return True  # Always confirm
 
    if risk_level == "medium":
        # Conditional confirmation based on parameters
        if tool_name == "run_dbt_model" and tool_input.get("full_refresh"):
            return True
        return False
 
    if risk_level == "low":
        return True  # Confirm but don't block
 
    return False

Scope Limiting

Don't give your agent tools it doesn't need. A customer support agent doesn't need a "deleteuser_account" tool. A data pipeline agent doesn't need a "deploy_to_production" tool. Every tool you add is a capability the agent _might use in unexpected ways.

Think about it this way: if an adversarial user crafted the perfect prompt injection to make your agent misbehave, what's the worst thing it could do with the tools available? If the answer makes you uncomfortable, remove those tools.

Token and Cost Budgets

python
def run_agent_with_budget(
    user_message: str,
    max_iterations: int = 10,
    max_input_tokens: int = 100_000,
    max_output_tokens: int = 20_000
):
    """Agent loop with token budget enforcement."""
    total_input_tokens = 0
    total_output_tokens = 0
    messages = [{"role": "user", "content": user_message}]
 
    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=TOOLS,
            messages=messages
        )
 
        total_input_tokens += response.usage.input_tokens
        total_output_tokens += response.usage.output_tokens
 
        if total_input_tokens > max_input_tokens:
            return "Budget exceeded: too many input tokens consumed. Stopping agent."
        if total_output_tokens > max_output_tokens:
            return "Budget exceeded: too many output tokens consumed. Stopping agent."
 
        # ... rest of agent loop

This prevents runaway agents from burning through your API credits. Set reasonable budgets based on your use case. A support agent answering a single question probably shouldn't need more than 50K input tokens. If it does, something has gone wrong.

DevOps Agent: A Complete Example

Let me sketch out a DevOps agent that handles incident response. This is the kind of agent that actually justifies the "autonomous" label—it needs to investigate, diagnose, and take corrective action without a human babysitting every step:

The tool set includes: check_service_health (ping endpoints, check status pages), query_logs (search application and infrastructure logs), get_metrics (pull from Prometheus/Datadog), scale_service (adjust replica counts—requires confirmation), rollback_deployment (revert to previous version—requires confirmation), and page_oncall (escalate to a human).

The agent's system prompt establishes the decision framework: "You are an incident response agent. When alerted to an issue, investigate systematically: check service health, query recent logs for errors, examine metrics for anomalies. Diagnose before acting. For corrective actions like scaling or rollbacks, explain your reasoning and get confirmation. If you can't diagnose the issue within 5 investigation steps, escalate to the on-call engineer."

Notice the layered guardrails. The agent can investigate freely—querying logs and metrics is read-only and safe. But corrective actions require confirmation. And there's an escape hatch: if investigation stalls, the agent escalates rather than thrashing.

Common Pitfalls and How to Avoid Them

After building a dozen production agents, here are the failure modes I see repeatedly:

Vague tool descriptions. I said it at the top and I'll say it again. "Searches the database" is useless. "Executes a read-only PostgreSQL SELECT query against the analytics database, returning up to 1000 rows, with a maximum timeout of 30 seconds" is useful. Be exhaustively specific about what the tool does, when to use it, and what constraints exist.

Missing stop conditions. Agents without termination criteria run forever. Always cap iterations. Always set token budgets. Always have a fallback response for when the agent can't complete its task.

Too many tools. I've seen people give agents 30+ tools. Claude handles it better than most models, but decision quality degrades as the tool count climbs. Keep it under 15 tools for any single agent. If you need more, break it into specialized sub-agents.

No error context in tool responses. When a tool fails, don't just return {"error": true}. Return what went wrong, why it might have happened, and what the agent should try instead. Claude can reason about failures—but only if you give it something to reason about.

Skipping the system prompt. The system prompt is where you establish the agent's decision-making framework, its priorities, and its boundaries. "You are a helpful assistant" is not a system prompt. "You are a data pipeline agent. Always check data freshness before running transformations. Never run full refreshes without confirmation. If quality checks fail, stop the pipeline and report" is a system prompt.

Where This Is All Heading

Agent architectures are evolving fast. The pattern I've shown here—single agent, flat tool set, linear decision loop—is the foundation. But production systems are already moving toward multi-agent architectures where specialized agents coordinate, hierarchical tool access where sub-agents have narrower tool sets than their parent, and persistent agent memory where context carries across sessions.

The fundamentals don't change though. Well-defined tools. Bounded decision loops. Graceful error recovery. Appropriate guardrails. Nail these four things and everything else is an optimization.

The teams I see shipping the most reliable agents aren't the ones writing the cleverest prompts. They're the ones writing the best tool definitions. They're treating tool schemas like API contracts—versioned, documented, tested. They're investing in the boring infrastructure that makes autonomous behavior safe and predictable.

That's the real hidden layer here. Everyone wants to talk about agent reasoning and decision-making and emergent behavior. Nobody wants to talk about tool description copy-editing and error response schemas and token budget accounting. But that's where the actual reliability lives. The agent is only as good as its tools.

Build your tools like you're writing documentation for a very capable but very literal new hire. Be precise. Be explicit. Cover the edge cases. And always, always cap the iteration count.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project