Multi-Agent Systems: Coordination Patterns for LLM Agents
You've probably wondered how teams of AI agents can actually work together without tripping over each other. The honest answer? It's harder than it sounds. But we've cracked patterns that make it work, and they're surprisingly elegant once you understand the underlying principles.
Multi-agent systems are becoming essential infrastructure as LLM applications-infrastructure-content-safety-llm-applications) grow more sophisticated. A single agent has real limitations - it can hallucinate, miss nuances, and bottleneck on complex reasoning. But coordinate multiple specialized agents? That's where real power emerges. Let's explore the coordination patterns that separate production-grade multi-agent systems from the toy versions.
Consider the simplest case: you ask an agent a question. The agent reasons through it, hallucinates a plausible-sounding but incorrect fact, and returns an answer. You have no way to know it's wrong without external verification. Now imagine instead that you ask a specialist agent, who returns an answer. Then you ask a critic agent who specifically looks for flaws in that answer. Then you ask a synthesizer agent who weighs both perspectives and returns a final answer. If disagreement is flagged as a concern, you route the question to a human for final review.
This is multi-agent coordination in practice. Not because three agents are always better than one. But because they specialize in different things. One agent is good at quick reasoning. One agent is good at critical thinking. One agent is good at synthesis. Combine them, and you get something better than any individual agent.
The catch is coordination overhead. Three agents need to communicate, pass context, and handle failures. If agent two crashes, what do agents one and three do? If agent one takes 10 seconds while agents two and three take 1 second, do we wait for agent one or proceed with partial information? These questions define the architecture, and getting them wrong is how most multi-agent systems fail.
Table of Contents
- Why Multi-Agent Systems Matter
- The Performance Gains Are Real
- The Coordination Landscape: From Loose to Tight
- Loose Coupling: Independent Agents
- Tight Coupling: Shared State
- Hierarchical Coordination: Manager Pattern
- Mesh Coordination: Full Communication
- Production-Grade Coordination: The Hybrid Approach
- Practical Coordination Patterns
- Pattern 1: The Orchestrator
- Pattern 2: Agent Router with Direct Communication
- Pattern 3: Message Bus with Event Coordination
- Pattern 4: Consensus-Based Decisions
- Common Coordination Challenges and Solutions
- Challenge 1: Circular Dependencies
- Challenge 2: Cascading Failures
- Challenge 3: State Consistency Across Agents
- Production Considerations
- Monitoring Multi-Agent Systems
- Scaling Multi-Agent Systems
- Real-World Case Study: Multi-Agent Code Review System
- Key Takeaways
Why Multi-Agent Systems Matter
Before diving into patterns, we need to understand the why. Single-agent architectures hit a wall around task complexity. When you ask one agent to be a researcher, analyst, coder, and validator simultaneously, quality degrades. Specialization works.
Think about it: Would you hire one person to be your entire engineering team, or would you hire specialists who collaborate? The math is the same for AI. Routing customer queries to a domain expert agent beats routing everything to a generalist. Having a validator agent catch mistakes beats hoping the first agent gets it right.
The challenge is coordination. How do agents discover they exist? How do they communicate? What happens when one fails? How do they agree on an answer? These questions define the architecture-production-deployment-guide).
In production systems, multi-agent coordination becomes the critical path. A single slow agent blocks everything. A failure in one agent can cascade. Lack of context sharing means duplicate work. These aren't theoretical concerns - they're operational realities that matter every day.
The Performance Gains Are Real
Let's ground this with a concrete example. You're building a code review system using LLMs-llms). You could use one large model to review everything - style, security, performance, tests. That model would need to be highly capable and would take 30-60 seconds to review a typical PR. Or you could use four specialized agents: one for style (5s), one for security (10s), one for performance (8s), one for tests (6s). Run them in parallel and you have results in 10s, not 30s. Plus, each agent is specialized and can use a smaller, cheaper model. Total cost drops by 60% while latency drops by 70%.
This scales. If you need 100 code reviews per day, one omniscient agent requires enough capacity for 100 * 45s = 75 minutes of inference time per day. Four agents in parallel require 100 * 10s = 17 minutes. You've just freed up 58 minutes of inference time per day that you can allocate to something else. Over a year, that's hundreds of thousands of dollars in compute savings.
The deeper win is accuracy. Each agent specializes. The security agent sees security-specific patterns. The performance agent sees performance anti-patterns. Neither agent gets distracted trying to be good at everything. Specialized models also converge faster during any additional fine-tuning you do. Your security agent, trained specifically on security issues, will outperform a generalist model on security even if the generalist is much larger.
Multi-agent systems introduce a layer of complexity that single-agent systems don't have. When you have multiple agents working together, you need mechanisms for them to communicate, coordinate, and resolve conflicts. An agent might need to ask another agent for information. One agent might be waiting for another to finish before it can proceed. Multiple agents might try to modify the same resource concurrently. Without proper coordination, you get chaos.
The coordination challenge becomes acute as you scale agents. With two agents, you might be able to manage with simple message passing. With ten agents, you need structured protocols. With a hundred agents, you need a sophisticated orchestration system. The architecture that worked for two agents doesn't work for ten, and neither works for a hundred. You need to think about scalability from the beginning.
The types of coordination problems are diverse. There's the sequencing problem: Agent A must finish before Agent B starts. There's the aggregation problem: Agent A and Agent B both work on a problem independently, then their results are combined. There's the conditional routing problem: based on Agent A's output, either Agent B or Agent C runs. There's the conflict resolution problem: when two agents propose different solutions, how do you decide which one to use? Each of these requires different coordination patterns.
The tradeoff between loose and tight coupling is central to the design. Loosely coupled agents are independent and resilient; if one fails, others continue. But loose coupling makes it hard to ensure correctness. If an agent fails silently and the system doesn't detect it, downstream agents might make decisions based on incorrect data. Tightly coupled agents ensure correctness and synchronization, but they're brittle; if one fails, everything stops. Production systems need to navigate this tradeoff carefully, often using hybrid approaches that are tight where it matters for correctness but loose where it helps with resilience.
The Coordination Landscape: From Loose to Tight
Coordination patterns exist on a spectrum. At one end, you have loose coupling: agents work independently with minimal communication. At the other, tight coupling: agents share state constantly and depend on each other's success. Production systems usually live in the middle, trading off flexibility for reliability.
Loose Coupling: Independent Agents
In loose coupling, agents operate independently. They receive a task, work on it, produce output. No real coordination beyond the initial request.
When it works: Tasks that decompose cleanly. Customer support routing: route to billing agent or technical agent based on query content. Each agent works independently with its own knowledge base and tools.
When it breaks: Tasks requiring real collaboration. Building a complex software architecture requires discussion, iteration, feedback. One agent designing the database and another agent designing the API without coordination creates incompatibilities. You catch them weeks later in integration testing.
Real-world example: A document classification system where different agents handle different document types. No coordination needed - each agent is specialized for its type. But if document understanding requires cross-type context (understanding legal implications of a technical specification), independent agents fail.
Tight Coupling: Shared State
Tight coupling means agents share mutable state. They update a shared database, memory structure, or message queue. Every decision gets written; every agent reads before acting.
When it works: Tasks where every agent needs global context. The best example: multi-turn negotiations. Agent A makes an offer, Agent B sees it in shared state, Agent C realizes the implications and escalates. All agents working from the same facts.
When it breaks: At scale, contention becomes a killer. If every agent needs to lock the shared database to act, you've just serialized your multi-agent system. The benefit of parallelism vanishes.
Real-world example: A customer service escalation system where multiple agents see the same ticket and need to coordinate. All good until peak load: 100 agents all trying to read/update the same shared ticket state. Your database becomes the bottleneck, not the agents.
Hierarchical Coordination: Manager Pattern
A manager agent orchestrates. Workers report to the manager. The manager assigns tasks, collects results, and synthesizes answers. Classic divide-and-conquer.
When it works: Well-structured problems with clear task decomposition. Research task: manager breaks it into "search academic papers", "summarize findings", "identify limitations". Each worker does their thing. Manager synthesizes.
When it breaks: When workers need to iterate or negotiate. Manager can become a bottleneck. If a worker finishes early and needs another worker's output, they can't communicate directly - they go back to the manager, who queues the work. Latency suffers.
Real-world example: A code review system. Manager (orchestrator) gets a PR. It asks Agent-Backend to review backend changes, Agent-Frontend to review frontend changes. Both work in parallel, report back. Manager synthesizes feedback and writes the summary. Clean and works great. But if Agent-Backend discovers a frontend issue during review, they have to report to the manager, who then re-routes to Agent-Frontend. Iteration is slow.
Mesh Coordination: Full Communication
Every agent can talk to every other agent. P2P network of agents. Maximum flexibility, maximum complexity.
When it works: Small teams with strong protocols. 3-5 agents, clear communication rules, message versioning. Each agent knows how to handle messages from every other agent.
When it breaks: At scale (10+ agents), you have N² potential communication paths. Every pair needs protocol agreement. If Agent-3 releases with a new message format, all agents it talks to need updates. Version drift creeps in. Debugging distributed failures becomes nightmare territory.
Real-world example: Multi-LLM debate systems. Agent-A proposes something. Agent-B critiques. Agent-C asks clarifying questions. Agent-A responds to all. This works for small debates (3-4 agents). With 10 agents and overlapping rounds, tracking who said what and enforcing consistency becomes painful.
Production-Grade Coordination: The Hybrid Approach
The sweet spot in production systems? Hybrid coordination:
- Manager layer for task decomposition (structured, clear)
- Direct agent communication for specific coordination needs (efficient)
- Shared context layer for facts everyone needs (consistency)
- Message versioning for evolution (safety)
This combines the structure of hierarchy with the efficiency of direct communication.
Practical Coordination Patterns
Let's move from theory to code. Here are the patterns you'll actually implement.
Pattern 1: The Orchestrator
One central orchestrator manages the workflow. Workers are stateless - they receive work, produce output, send back. Think Kubernetes scheduler + containers.
# Orchestrator pattern
from typing import Any, Dict, List
from dataclasses import dataclass
from enum import Enum
class AgentType(Enum):
RESEARCHER = "researcher"
ANALYZER = "analyzer"
VALIDATOR = "validator"
@dataclass
class Task:
task_id: str
description: str
required_agents: List[AgentType]
status: str = "pending" # pending, assigned, in_progress, completed
results: Dict[str, Any] = None
class Orchestrator:
"""Manages task decomposition and agent coordination."""
def __init__(self):
self.agents = {} # AgentType -> Agent instance
self.tasks = {} # task_id -> Task
self.work_queue = []
def register_agent(self, agent_type: AgentType, agent):
"""Register an agent for a specific role."""
if agent_type not in self.agents:
self.agents[agent_type] = []
self.agents[agent_type].append(agent)
def decompose_task(self, task: Task) -> List[Task]:
"""Break task into subtasks for different agents."""
subtasks = []
if AgentType.RESEARCHER in task.required_agents:
subtasks.append(Task(
task_id=f"{task.task_id}:research",
description=f"Research aspect of: {task.description}",
required_agents=[AgentType.RESEARCHER]
))
if AgentType.ANALYZER in task.required_agents:
subtasks.append(Task(
task_id=f"{task.task_id}:analyze",
description=f"Analyze aspect of: {task.description}",
required_agents=[AgentType.ANALYZER]
))
return subtasks
def assign_work(self, task: Task):
"""Assign a task to available agents."""
agent_type = task.required_agents[0]
if agent_type in self.agents and self.agents[agent_type]:
agent = self.agents[agent_type][0] # Round-robin in production
task.status = "assigned"
return agent
return None
def execute_task(self, task: Task) -> Dict[str, Any]:
"""Execute a task through assigned agent."""
agent = self.assign_work(task)
if not agent:
raise ValueError(f"No agent available for {task.required_agents}")
# Agent executes work
result = agent.work(task.description)
task.status = "completed"
task.results = result
return result
def synthesize_results(self, parent_task: Task, subtask_results: List[Dict]) -> str:
"""Combine results from multiple agents."""
# In production, use a summarizer agent for this
combined = "\n".join([f"- {r.get('output', str(r))}" for r in subtask_results])
return f"Synthesis of findings:\n{combined}"
# Usage
orchestrator = Orchestrator()
# Register agents
class ResearchAgent:
def work(self, task: str) -> Dict[str, Any]:
return {"output": f"Researched: {task}"}
class AnalyzerAgent:
def work(self, task: str) -> Dict[str, Any]:
return {"output": f"Analyzed: {task}"}
orchestrator.register_agent(AgentType.RESEARCHER, ResearchAgent())
orchestrator.register_agent(AgentType.ANALYZER, AnalyzerAgent())
# Create and execute task
main_task = Task(
task_id="task_001",
description="Understand market trends in AI infrastructure",
required_agents=[AgentType.RESEARCHER, AgentType.ANALYZER]
)
subtasks = orchestrator.decompose_task(main_task)
results = [orchestrator.execute_task(st) for st in subtasks]
synthesis = orchestrator.synthesize_results(main_task, results)
print(synthesis)
# Output: Synthesis of findings:
# - Researched: Research aspect of: Understand market trends in AI infrastructure
# - Analyzed: Analyzed aspect of: Understand market trends in AI infrastructureThe orchestrator pattern works beautifully when task decomposition is clear. But it breaks down when agents need to iterate - when research findings affect analysis questions.
Pattern 2: Agent Router with Direct Communication
Multiple agents register their capabilities. A router dispatches to the right agent. Agents can also communicate directly (within bounds).
from typing import Callable, Dict, List
from dataclasses import dataclass
@dataclass
class AgentCapability:
name: str
description: str
handler: Callable
class AgentRegistry:
"""Lightweight registry for agent discovery and direct communication."""
def __init__(self):
self.agents = {} # name -> agent_instance
self.capabilities = {} # capability_name -> [agents]
def register_agent(self, name: str, agent, capabilities: List[str]):
"""Register an agent with specific capabilities."""
self.agents[name] = agent
for cap in capabilities:
if cap not in self.capabilities:
self.capabilities[cap] = []
self.capabilities[cap].append(name)
def find_agent_for_capability(self, capability: str) -> str:
"""Find an agent that can handle a capability."""
agents = self.capabilities.get(capability, [])
if agents:
return agents[0] # Round-robin in production
return None
def route_work(self, capability: str, task: str) -> Dict:
"""Route work to appropriate agent."""
agent_name = self.find_agent_for_capability(capability)
if not agent_name:
return {"error": f"No agent for capability: {capability}"}
agent = self.agents[agent_name]
result = agent.handle(task)
result["handled_by"] = agent_name
return result
def send_message(self, from_agent: str, to_agent: str, message: str) -> Dict:
"""Allow direct agent-to-agent communication."""
if to_agent not in self.agents:
return {"error": f"Agent not found: {to_agent}"}
agent = self.agents[to_agent]
return agent.receive_message(from_agent, message)
# Example agent
class SpecialistAgent:
def __init__(self, name: str):
self.name = name
self.memory = []
def handle(self, task: str) -> Dict:
"""Process a work request."""
self.memory.append(task)
return {"output": f"{self.name} handled: {task}"}
def receive_message(self, from_agent: str, message: str) -> Dict:
"""Receive direct message from another agent."""
response = f"{self.name} responding to {from_agent}: {message}"
self.memory.append(response)
return {"response": response}
# Usage
registry = AgentRegistry()
# Register multiple agents
backend_agent = SpecialistAgent("backend_expert")
frontend_agent = SpecialistAgent("frontend_expert")
registry.register_agent("backend_expert", backend_agent, ["backend_design", "database_design"])
registry.register_agent("frontend_expert", frontend_agent, ["frontend_design", "ui_ux"])
# Route work
result = registry.route_work("backend_design", "Design database schema")
print(result)
# Output: {'output': 'backend_expert handled: Design database schema', 'handled_by': 'backend_expert'}
# Direct agent communication
msg_result = registry.send_message("backend_expert", "frontend_expert", "What UI components do you need?")
print(msg_result)
# Output: {'response': "frontend_expert responding to backend_expert: What UI components do you need?"}This pattern gives you flexibility - agents can communicate directly when needed, but the registry enforces discoverability. It scales better than pure mesh because not every pair needs to know about every other pair.
Pattern 3: Message Bus with Event Coordination
Agents publish events and subscribe to topics they care about. Decoupling through pub/sub.
from typing import Callable, Dict, List, Any
from dataclasses import dataclass
from datetime import datetime
@dataclass
class Event:
event_type: str # "research_completed", "analysis_started", etc.
agent_name: str
timestamp: datetime
data: Dict[str, Any]
class EventBus:
"""Pub/Sub event bus for agent coordination."""
def __init__(self):
self.subscribers = {} # event_type -> [callbacks]
self.event_log = []
def subscribe(self, event_type: str, callback: Callable):
"""Subscribe to an event type."""
if event_type not in self.subscribers:
self.subscribers[event_type] = []
self.subscribers[event_type].append(callback)
def publish(self, event: Event):
"""Publish an event."""
self.event_log.append(event)
if event.event_type in self.subscribers:
for callback in self.subscribers[event.event_type]:
try:
callback(event)
except Exception as e:
print(f"Error in subscriber: {e}")
def get_events_for_agent(self, agent_name: str) -> List[Event]:
"""Get all events relevant to an agent."""
return [e for e in self.event_log if e.agent_name == agent_name]
class PubSubAgent:
"""Agent that uses event bus for coordination."""
def __init__(self, name: str, bus: EventBus):
self.name = name
self.bus = bus
self.state = {"completed_tasks": []}
def work(self, task: str):
"""Do work and publish completion event."""
result = f"{self.name} completed: {task}"
self.state["completed_tasks"].append(task)
# Publish completion event
self.bus.publish(Event(
event_type="task_completed",
agent_name=self.name,
timestamp=datetime.now(),
data={"task": task, "result": result}
))
return result
def on_peer_completed(self, event: Event):
"""React to peer agent completion."""
print(f"{self.name} detected {event.agent_name} completion: {event.data['task']}")
# Take action if needed
if "schema" in event.data["task"].lower():
self.work(f"Adjust my implementation for {event.agent_name}'s schema design")
# Usage
bus = EventBus()
# Create agents
backend = PubSubAgent("backend", bus)
frontend = PubSubAgent("frontend", bus)
# Subscribe frontend to backend completions
bus.subscribe("task_completed", frontend.on_peer_completed)
bus.subscribe("task_completed", backend.on_peer_completed)
# Work triggers events
backend.work("Design database schema")
frontend.work("Create API client")
print(f"\nEvent log ({len(bus.event_log)} events):")
for e in bus.event_log:
print(f" {e.event_type}: {e.agent_name}")The event bus decouples agents completely. They don't know about each other - just the events they care about. This scales well but can create timing issues: what if an event arrives before an agent subscribes?
Pattern 4: Consensus-Based Decisions
When multiple agents disagree, use voting or consensus protocols.
from typing import Dict, List
from enum import Enum
class VoteOption(Enum):
AGREE = "agree"
DISAGREE = "disagree"
ABSTAIN = "abstain"
class ConsensusAgent:
"""Agent that participates in consensus decisions."""
def __init__(self, name: str, expertise: str):
self.name = name
self.expertise = expertise
def evaluate(self, proposal: str) -> (VoteOption, str):
"""Evaluate a proposal and vote."""
# Simulate domain-specific evaluation
if self.expertise == "backend" and "database" in proposal:
return (VoteOption.AGREE, f"{self.name}: This aligns with backend requirements")
elif self.expertise == "frontend" and "UI" in proposal:
return (VoteOption.AGREE, f"{self.name}: This makes sense for frontend")
else:
return (VoteOption.ABSTAIN, f"{self.name}: Out of my domain")
class ConsensusCoordinator:
"""Manage consensus voting between agents."""
def __init__(self):
self.agents = []
self.votes = {}
def add_agent(self, agent: ConsensusAgent):
self.agents.append(agent)
def get_consensus(self, proposal: str, required_agreement: float = 0.5) -> Dict:
"""Get consensus from agents on a proposal."""
votes = []
reasoning = []
for agent in self.agents:
vote, reason = agent.evaluate(proposal)
votes.append(vote)
reasoning.append(reason)
agrees = sum(1 for v in votes if v == VoteOption.AGREE)
disagrees = sum(1 for v in votes if v == VoteOption.DISAGREE)
total = len([v for v in votes if v != VoteOption.ABSTAIN])
agreement_ratio = agrees / total if total > 0 else 0
consensus = agreement_ratio >= required_agreement
return {
"proposal": proposal,
"consensus_reached": consensus,
"agreement_ratio": agreement_ratio,
"votes": {
"agree": agrees,
"disagree": disagrees,
"abstain": len(votes) - agrees - disagrees
},
"reasoning": reasoning,
"decision": "APPROVED" if consensus else "REJECTED"
}
# Usage
coordinator = ConsensusCoordinator()
backend_agent = ConsensusAgent("backend_expert", "backend")
frontend_agent = ConsensusAgent("frontend_expert", "frontend")
infra_agent = ConsensusAgent("infra_expert", "infrastructure")
coordinator.add_agent(backend_agent)
coordinator.add_agent(frontend_agent)
coordinator.add_agent(infra_agent)
# Test consensus on a proposal
proposal = "Use PostgreSQL for primary database"
result = coordinator.get_consensus(proposal, required_agreement=0.5)
print(f"Proposal: {result['proposal']}")
print(f"Decision: {result['decision']}")
print(f"Agreement: {result['agreement_ratio']:.1%}")
print(f"Votes: {result['votes']}")
print(f"\nReasoning:")
for reason in result['reasoning']:
print(f" {reason}")Consensus patterns are powerful for critical decisions but slow for routine work. Use them sparingly, only when disagreement creates risk.
Implementing effective coordination requires understanding not just the patterns, but the failure modes that arise when coordination goes wrong. One common failure is deadlock. Agent A is waiting for Agent B to finish, while Agent B is waiting for Agent A. Both are stuck forever. Another failure is livelock. Agents are constantly communicating but never making progress. Or starvation, where one agent is starved of resources because other agents are prioritized. Or cascading failures, where one agent's failure triggers failures in dependent agents.
The challenge is that these failures are often subtle and hard to debug. With deterministic systems, you can reproduce failures reliably and fix them. With non-deterministic agent systems, a failure might happen only under certain conditions. Maybe it only happens when one agent is particularly slow. Maybe it only happens under peak load. Testing becomes critical, but testing multi-agent systems is notoriously difficult. You need to test not just the happy path, but failure modes, race conditions, and edge cases.
Monitoring and observability become even more important with multiple agents. You need to understand not just what each agent is doing, but how they're coordinating. What messages are being passed between agents? What's the latency of each message? Are any agents blocked waiting for responses? Are any agents taking unexpectedly long? You need to track the full flow of work through the system of agents.
One insight that helps tremendously is to make coordination explicit rather than implicit. Instead of agents figuring out coordination on their own through message passing, have an explicit orchestrator that knows the coordination protocol and enforces it. This makes the system more predictable and easier to reason about. It makes failure modes more obvious and easier to handle. The downside is less flexibility, but in production systems, predictability and debuggability often matter more than flexibility.
Common Coordination Challenges and Solutions
Challenge 1: Circular Dependencies
Agent A needs output from Agent B, which needs output from Agent C, which needs output from Agent A. Classic deadlock.
Solution: Break cycles with timeouts and defaults. Agent C doesn't wait forever for Agent A - after a timeout, it uses a reasonable default and proceeds. Agent A eventually gets the real answer and can improve its work.
In code:
import asyncio
from typing import Optional
async def get_with_timeout(agent, task: str, timeout_seconds: float) -> Optional[str]:
"""Get result from agent with timeout fallback."""
try:
result = await asyncio.wait_for(
agent.work_async(task),
timeout=timeout_seconds
)
return result
except asyncio.TimeoutError:
return agent.get_default_result(task)
# Usage
async def workflow():
agent_a_result = await get_with_timeout(agent_a, "Task for A", timeout_seconds=5)
agent_b_result = await get_with_timeout(agent_b, f"Task for B with A's input: {agent_a_result}", timeout_seconds=5)
# Continues even if agents are slowChallenge 2: Cascading Failures
One agent crashes. It was supposed to produce output that five other agents depend on. All five fail.
Solution: Implement circuit breakers and graceful degradation. If an agent fails, log it, use cached/default results, and alert. Don't let failures cascade.
from enum import Enum
from datetime import datetime, timedelta
class CircuitBreaker:
"""Prevent cascading failures from unreliable agents."""
class State(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject calls
HALF_OPEN = "half_open" # Testing recovery
def __init__(self, failure_threshold: int = 5, timeout_seconds: int = 60):
self.state = self.State.CLOSED
self.failure_count = 0
self.last_failure = None
self.failure_threshold = failure_threshold
self.timeout_seconds = timeout_seconds
def call(self, agent_fn, *args, **kwargs):
"""Call agent function with circuit breaker protection."""
if self.state == self.State.OPEN:
if self._timeout_expired():
self.state = self.State.HALF_OPEN
else:
raise Exception(f"Circuit breaker OPEN for agent")
try:
result = agent_fn(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise e
def _on_success(self):
self.failure_count = 0
self.state = self.State.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = self.State.OPEN
def _timeout_expired(self) -> bool:
if not self.last_failure:
return True
return datetime.now() - self.last_failure > timedelta(seconds=self.timeout_seconds)
# Usage
breaker = CircuitBreaker(failure_threshold=3)
def agent_work(task: str):
# Simulated agent that fails sometimes
import random
if random.random() < 0.7: # 70% failure rate
raise Exception("Agent failed")
return f"Completed: {task}"
try:
for i in range(10):
result = breaker.call(agent_work, f"task_{i}")
print(f"Success: {result}")
except Exception as e:
print(f"Circuit breaker: {e}")
# Use cached result instead
print("Using cached result for downstream agents")Challenge 3: State Consistency Across Agents
Agents update shared state. Agent A reads some state, acts on it, writes back. Meanwhile Agent B read the same state, but A's changes invalidate B's assumptions.
Solution: Implement optimistic locking or event sourcing. Every state change is an immutable event. Agents replay events to get current state.
from dataclasses import dataclass
from typing import List
@dataclass
class StateEvent:
event_id: int
agent_name: str
change: dict
timestamp: float
class EventSourcedState:
"""Maintain consistent state through immutable events."""
def __init__(self):
self.events: List[StateEvent] = []
self.event_id_counter = 0
def record_change(self, agent_name: str, change: dict) -> StateEvent:
"""Record a state change as an immutable event."""
event = StateEvent(
event_id=self.event_id_counter,
agent_name=agent_name,
change=change,
timestamp=datetime.now().timestamp()
)
self.events.append(event)
self.event_id_counter += 1
return event
def get_state(self, as_of_event_id: int = None) -> dict:
"""Reconstruct current state by replaying events."""
state = {}
events_to_apply = self.events
if as_of_event_id is not None:
events_to_apply = [e for e in self.events if e.event_id <= as_of_event_id]
for event in events_to_apply:
state.update(event.change)
return state
# Usage
state_log = EventSourcedState()
# Agent A makes a change
state_log.record_change("agent_a", {"database": "designed", "table_count": 5})
# Agent B makes a change
state_log.record_change("agent_b", {"api_endpoints": 12})
# Agent C can ask: what's the current state?
current_state = state_log.get_state()
print(f"Current state: {current_state}")
# Output: {'database': 'designed', 'table_count': 5, 'api_endpoints': 12}
# Or: what was the state at event 0?
old_state = state_log.get_state(as_of_event_id=0)
print(f"State at event 0: {old_state}")
# Output: {'database': 'designed', 'table_count': 5}Event sourcing removes the consistency problem: state is never ambiguous - it's the product of a deterministic sequence of events.
Production Considerations
Monitoring Multi-Agent Systems
Multi-agent systems create observability challenges. You need to track:
- Agent health: Is each agent responding? How many requests per second can it handle?
- Coordination latency: How long does a full task decomposition → execution → synthesis take? Where's the bottleneck?
- Failure patterns: Which agents fail together? Which are reliability anchors?
- Communication volume: How many messages are agents exchanging? Are there unnecessary round trips?
Implement structured logging for all agent communications:
import json
from datetime import datetime
from typing import Dict, Any
class AgentAuditLog:
"""Log all agent activities for observability."""
def __init__(self):
self.logs = []
def log_task(self, agent_name: str, task: str, status: str, metadata: Dict[str, Any] = None):
"""Log a task execution."""
entry = {
"timestamp": datetime.now().isoformat(),
"agent": agent_name,
"task": task,
"status": status,
"metadata": metadata or {}
}
self.logs.append(entry)
# In production, send to logging service
print(json.dumps(entry))
def log_message(self, from_agent: str, to_agent: str, message_type: str):
"""Log agent-to-agent communication."""
entry = {
"timestamp": datetime.now().isoformat(),
"from": from_agent,
"to": to_agent,
"message_type": message_type
}
self.logs.append(entry)
print(json.dumps(entry))
audit_log = AgentAuditLog()
# Log activities
audit_log.log_task("research_agent", "Find AI trends", "started")
audit_log.log_task("research_agent", "Find AI trends", "completed", {"results": 42})
audit_log.log_message("research_agent", "analysis_agent", "share_findings")Scaling Multi-Agent Systems
As you add more agents, coordination overhead grows. Key strategies:
-
Partition agents by domain: Instead of one registry for all agents, have domain-specific registries. Agents within a domain talk; domains coordinate through bridges.
-
Async everything: Don't wait synchronously for agent results. Publish work, agents work in parallel, publish results. Your orchestrator doesn't block.
-
Timeouts and deadlines: Every task has a timeout. Every agent call has a deadline. When deadline approaches, synthesize with what you have. Good enough beats late.
Testing multi-agent systems requires a different mental model than testing single-agent systems. You need to test not just the happy path through a single agent, but the orchestration between agents. You need deterministic test cases where agent outputs are fixed, so you can verify that coordination works correctly. You need to test failure scenarios: what happens when one agent fails? What happens when an agent times out? What happens when network communication is delayed?
One pattern that helps tremendously is the use of deterministic agent configurations for testing. Instead of using the actual LLM, substitute a mock agent that returns predetermined outputs. This lets you test the coordination logic independently of the LLM's behavior. You can test what happens when Agent A returns response X and Agent B receives it. You can test the full orchestration workflow without waiting for LLMs or worrying about non-deterministic behavior.
Another key insight is that multi-agent systems amplify the importance of clear communication protocols. When agents are coordinating, they need to understand each other. Ambiguous communication leads to misunderstandings, which lead to errors. Using structured message formats like JSON schemas helps. Agents can validate that incoming messages match the expected format before processing them. This catches coordination bugs early.
Cost attribution becomes significantly more complex with multiple agents. A user request might trigger a cascade of agent interactions. Agent A calls Agent B, which calls Agent C, which calls Agent D. Each agent might make multiple LLM calls. The total cost is the sum of all these calls. You need to trace this entire orchestration back to the original user request so you can attribute cost accurately. This requires propagating a trace ID through all inter-agent communication.
Scalability of coordination becomes a critical concern as you add more agents. If you have ten agents all trying to coordinate through a central orchestrator, that orchestrator becomes a bottleneck. You need to think about how coordination scales. One approach is hierarchical coordination: groups of agents coordinate locally, then group orchestrators coordinate at a higher level. Another approach is decentralized coordination where agents negotiate with each other directly. Each approach has tradeoffs in terms of scalability, consistency, and debuggability.
Real-World Case Study: Multi-Agent Code Review System
Imagine a code review system using multiple specialized agents:
- StyleAgent: Checks code style, naming conventions
- SecureAgent: Looks for security issues
- PerfAgent: Analyzes performance bottlenecks
- TestAgent: Checks test coverage and quality
Using an orchestrator pattern:
# Code review orchestrator
class CodeReviewOrchestrator:
def __init__(self):
self.agents = {
"style": StyleAgent(),
"security": SecurityAgent(),
"performance": PerformanceAgent(),
"testing": TestingAgent()
}
def review(self, code: str) -> Dict:
"""Orchestrate full code review."""
# Parallel execution (in production, use asyncio)
results = {}
for agent_name, agent in self.agents.items():
try:
results[agent_name] = agent.review(code)
except Exception as e:
results[agent_name] = {"error": str(e), "issues": []}
# Synthesize findings
all_issues = []
for agent_name, result in results.items():
all_issues.extend(result.get("issues", []))
# Sort by severity
all_issues.sort(key=lambda x: x.get("severity", 0), reverse=True)
return {
"total_issues": len(all_issues),
"by_category": {name: len(result.get("issues", [])) for name, result in results.items()},
"issues": all_issues,
"recommendation": "APPROVE" if len(all_issues) == 0 else "REQUEST_CHANGES"
}The key insight: each agent works independently. No coordination overhead. Parallelism is implicit in the design.
Building observability into multi-agent systems from the start pays huge dividends. You need to understand what each agent is doing, what decisions it made, and why. Distributed tracing with trace IDs that flow through all agent interactions is invaluable. When something goes wrong across multiple agents, you can follow the trace ID and understand the full sequence of events.
The debugging experience is fundamentally different for multi-agent systems. With single-agent systems, you can replay execution and step through it. With multi-agent systems, you need to understand the interactions between agents. Did Agent A wait for Agent B? How long did the wait take? Did Agent B return an error that Agent A didn't handle? These questions require looking at multiple agents' logs in coordination.
Load balancing across multiple agents becomes a consideration when you have redundancy. If you have multiple instances of the same agent running concurrently, you need to distribute work among them fairly. You might use round-robin, or you might use more sophisticated strategies based on current load. The choice affects both performance and cost.
Versioning of agents is another practical concern. When you update an agent's logic, do all instances immediately switch to the new version? Or do you roll out gradually? Do you run multiple versions concurrently during transition? Each choice has implications for consistency and testing. Some systems maintain backward compatibility in agent protocols so old and new versions can interoperate. Others implement explicit version negotiation where agents discover each other's versions and adjust their communication accordingly.
The human element in agent systems is often underestimated. Humans need to understand what agents are doing and why. If agents are acting as black boxes, humans can't trust them. Comprehensive logging and visualization of agent behavior is essential for building confidence in agent systems. Some teams create dashboards that show what agents are thinking, what decisions they made, and what actions they took. This visibility enables human oversight and trust.
Key Takeaways
Multi-agent systems aren't just about throwing multiple LLMs at a problem. Real production systems require careful thought about:
- Coordination model: Orchestration for structured work, routing for flexible dispatch, pub/sub for loose coupling
- Failure handling: Circuit breakers, timeouts, graceful degradation
- State consistency: Event sourcing, immutable logs, or explicit locking
- Observability: Log everything, measure latencies, track patterns
- Scaling: Partition agents, go async, enforce deadlines
The teams shipping production multi-agent systems didn't start with perfect architecture. They started simple (single orchestrator), measured, hit problems, and evolved. Your job is to anticipate the problems and build with them in mind from day one.
Start with orchestration. Add direct communication when you need speed. Move to pub/sub when you need loose coupling. Measure constantly. That's how production systems get built.
** The organizational structure of teams building multi-agent systems matters. If different teams own different agents, you need clear contracts between them. Agent A needs to know what messages Agent B expects and what it will return. This requires documentation and possibly versioning of agent interfaces. Some teams implement agent registries where agents discover each other and negotiate compatibility. This makes the system more flexible but more complex.
Resilience patterns for multi-agent systems go beyond simple retries. You need to think about fallback agents. If Agent A fails, can Agent B do the same work? You need to think about partial failures. If Agent A succeeds for some requests but fails for others, how do you route appropriately? You need to think about cascading failures. Does one agent's failure cause others to fail? These patterns are well-studied in distributed systems; many apply directly to agent systems.
The testing of multi-agent systems benefits from simulation frameworks. Instead of running real LLMs, run mock agents with deterministic behavior. This lets you test agent orchestration at scale, testing what happens when you have hundreds or thousands of agents. You can test failure scenarios: what if agent A always fails? What if agent B is very slow? What if network latency is high? Simulation provides a safe sandbox for exploring how the system behaves under various conditions.
Metrics for agent coordination go beyond individual agent metrics. You need to understand coordination efficiency. How much time do agents spend waiting for each other? How much time do they spend in actual work? High wait times indicate coordination bottlenecks. You need to understand message volume between agents. High message volume indicates tight coupling or inefficient communication patterns. These metrics guide optimization efforts.
The future of multi-agent systems will likely involve more autonomous coordination with less explicit orchestration. Agents might negotiate with each other to decide the best way to solve a problem. Agents might dynamically form teams to work on novel problems. Agents might learn from experience what coordination patterns work best. This requires agents to have more autonomy but more potential for chaos and mistakes. Finding the right balance between structure and autonomy is an ongoing challenge.
One pattern that's particularly valuable is the observer pattern applied to agents. Agent A doesn't directly ask Agent B what it should do. Instead, Agent A announces what it's doing, and Agent B listens and responds if needed. This loose coupling enables scaling: you can add more agents without changing the communication patterns. The tradeoff is that the system becomes less deterministic. You can't easily predict what will happen when a new agent is added.
Resource allocation across agents becomes a coordination challenge at scale. When you have limited computational resources and many agents competing for them, who gets to run? Some teams implement priority-based queuing where high-priority agents run first. Others implement fairness algorithms that ensure every agent gets a chance to run. The choice affects both performance and user satisfaction.
The concept of agent health is important in multi-agent systems. An agent might be slow, unreliable, or producing poor-quality results. Rather than failing fast, you might want to gradually reduce reliance on an unhealthy agent while still giving it a chance to recover. This is similar to circuit breaker patterns in distributed systems. You continuously monitor agent health and adjust routing accordingly.
One challenge that arises in heterogeneous multi-agent systems is that agents might have different assumptions about data formats, error handling, or protocol details. This can lead to subtle bugs where Agent A sends a message it thinks is valid, but Agent B can't parse it. Strict type checking and schema validation help but can feel bureaucratic. Finding the right balance between flexibility and safety is an ongoing challenge.
The orchestration of agents across multiple machines introduces additional complexity. You need to handle network partitions where agents can't communicate. You need to handle machine failures where an agent suddenly disappears. You need to handle load distribution where agents are placed on different machines to balance load. These distributed systems challenges apply directly to multi-agent systems.
The audit trail of a multi-agent system is important for accountability. Who made which decision? Which agent recommended a particular action? These questions are important for legal compliance and incident investigation. Your logging and tracing infrastructure needs to capture this information in a way that's queryable and auditable.