Agent Patterns That Actually Work in Production
Lessons learned from building production AI agents: what works, what doesn't, and why most agent frameworks miss the mark on real-world complexity.
After building a dozen AI agent systems that handle real money, real compliance requirements, and real user frustration, I've learned that most agent frameworks solve the wrong problems.
The blog posts show autonomous agents booking flights and ordering groceries. The reality is messier: agents that need to work within existing systems, handle edge cases gracefully, and maintain audit trails that satisfy both regulators and angry users.
Here are the patterns that actually work.
The Multi-Agent Myth
Common Wisdom: Break complex tasks into multiple specialized agents that collaborate.
Reality Check: Agent-to-agent communication is where things break down.
Most production "multi-agent" systems are actually:
- One orchestrator agent that's really just a state machine
- Multiple specialized functions that happen to use LLM calls
- A lot of error handling code
Every agent boundary is a point of failure. Message passing, state synchronization, and error recovery across agents adds complexity faster than it adds capability.
What Works Instead: The Single Agent Pattern
class ProductionAgent:
def __init__(self):
self.tools = {
'document_processor': DocumentProcessor(),
'policy_checker': PolicyEngine(),
'report_generator': ReportGenerator()
}
self.state_machine = StateMachine()
def execute_workflow(self, task):
# Single agent with multiple tools
# Clear state transitions
# Centralized error handling
# Audit trail in one place
pass
Why it works:
- Single point of failure (which you can actually debug)
- Shared context across all operations
- Simpler error recovery
- Easier to audit and test
The Planning Fallacy
Common Wisdom: Agents should plan multi-step workflows before execution.
Reality Check: Plans don't survive contact with production systems.
I've seen agents generate beautiful 10-step plans that fail on step 2 because:
- External API changed response format
- User provided incomplete information
- Database timeout occurred
- Regulatory requirement changed mid-process
What Works Instead: The Adaptive Execution Pattern
class AdaptiveAgent:
def execute(self, goal):
while not self.is_complete(goal):
# Assess current situation
context = self.gather_context()
# Plan only the next immediate step
next_action = self.decide_next_action(context, goal)
# Execute with error handling
result = self.execute_with_recovery(next_action)
# Adapt based on result
if result.failed:
goal = self.adjust_goal(goal, result.error)
self.update_state(result)
Key insight: Successful agents are reactive, not predictive. They respond to what actually happens rather than what should happen.
The Tool Integration Reality
Common Wisdom: Give agents access to APIs and they'll figure out how to use them.
Reality Check: Production systems have authentication, rate limits, error conditions, and undocumented behaviors that LLMs can't reason about effectively.
The Wrapper Pattern That Works
class ReliableToolWrapper:
def __init__(self, api_client):
self.client = api_client
self.circuit_breaker = CircuitBreaker()
self.retry_policy = ExponentialBackoff()
def execute_with_context(self, action, context):
"""
Handles all the production concerns:
- Authentication refresh
- Rate limiting
- Error classification
- Retry logic
- Fallback strategies
- Audit logging
"""
with self.circuit_breaker:
return self.retry_policy.execute(
lambda: self._execute_safely(action, context)
)
def _execute_safely(self, action, context):
# The actual API call with all error handling
pass
The wrapper does what LLMs can't:
- Handle authentication token refresh
- Implement exponential backoff
- Classify errors (retry vs. fail fast)
- Maintain rate limit budgets
- Provide consistent error messages
The State Management Problem
Common Wisdom: Agents should be stateless for scalability.
Reality Check: Real workflows have state that matters. User context, partial results, approval chains, regulatory checkpoints.
The Persistent Context Pattern
class StatefulAgent:
def __init__(self, workflow_id):
self.context = WorkflowContext.load(workflow_id)
self.checkpoint_manager = CheckpointManager()
def execute_step(self, step):
# Save state before risky operations
checkpoint = self.checkpoint_manager.create(self.context)
try:
result = self.execute_with_tools(step)
self.context.update(result)
self.checkpoint_manager.commit(checkpoint)
except RecoverableError as e:
# Rollback to checkpoint
self.context = self.checkpoint_manager.restore(checkpoint)
raise RetryWithContext(e, self.context)
except FatalError as e:
# Save failure context for human review
self.context.mark_failed(e)
self.checkpoint_manager.save_failure_state(self.context)
raise
Why state persistence matters:
- User can resume interrupted workflows
- Audit requirements need complete history
- Error recovery can restart from checkpoints
- Compliance reviews need full context
The Human-in-the-Loop Reality
Common Wisdom: Agents should be fully autonomous.
Reality Check: Production systems need human oversight, approval workflows, and escalation paths.
The most successful agents I've built have clear handoff patterns:
The Escalation Pattern
class HumanIntegrationAgent:
def __init__(self):
self.confidence_threshold = 0.8
self.approval_required_keywords = ['payment', 'delete', 'approve']
def execute_with_oversight(self, task):
plan = self.generate_plan(task)
# Check if human approval needed
if (plan.confidence < self.confidence_threshold or
self.requires_approval(plan)):
approval_request = self.create_approval_request(plan)
return self.wait_for_human_approval(approval_request)
# Execute autonomously with monitoring
return self.execute_with_monitoring(plan)
def requires_approval(self, plan):
return any(keyword in plan.description.lower()
for keyword in self.approval_required_keywords)
The pattern works because:
- Agents handle routine cases autonomously
- Humans review edge cases and high-stakes decisions
- Clear escalation criteria prevent both micro-management and disasters
- Audit trail shows both agent reasoning and human oversight
The Monitoring & Observability Gap
Common Wisdom: Agent frameworks will provide built-in monitoring.
Reality Check: You need custom observability for production agent systems.
What to Monitor
class AgentObservability:
def __init__(self):
self.metrics = {
'task_completion_rate': TaskCompletionMetric(),
'error_rate_by_type': ErrorClassificationMetric(),
'human_escalation_rate': EscalationMetric(),
'cost_per_task': CostTrackingMetric(),
'user_satisfaction': SatisfactionMetric()
}
def track_execution(self, agent_execution):
with self.trace_context():
# Trace every LLM call with cost
# Monitor tool execution times
# Track state transitions
# Log confidence scores
# Measure end-to-end latency
pass
Key metrics for production agents:
- Task Success Rate: Not just "didn't crash" but "achieved user goal"
- Error Classification: Distinguish agent errors from system errors from user errors
- Cost per Task: LLM costs add up fast in production
- Human Escalation Rate: Are agents handling appropriate complexity?
- User Satisfaction: The ultimate measure of agent effectiveness
The Security Model Most Miss
Common Wisdom: Run agents in sandboxed environments.
Reality Check: Agents need access to real systems with real permissions, making security complex.
The Principle of Least Privilege for Agents
class SecureAgentContext:
def __init__(self, user_id, task_type):
# Dynamic permission based on task and user
self.permissions = PermissionManager.get_agent_permissions(
user_id=user_id,
task_type=task_type,
time_limit=timedelta(hours=1)
)
# Audit every permission use
self.audit_logger = AuditLogger(user_id, task_type)
def execute_with_permissions(self, action):
if not self.permissions.allows(action):
self.audit_logger.log_denied_action(action)
raise PermissionDenied(action)
self.audit_logger.log_permitted_action(action)
return self.execute(action)
Security principles that work:
- Agents inherit user permissions, not system permissions
- Time-bounded access tokens
- Comprehensive audit logging
- Explicit deny-by-default policies
- Regular permission reviews
The Deployment Patterns
Common Wisdom: Deploy agents like any other service.
Reality Check: Agents have different failure modes and operational needs.
The Circuit Breaker Pattern for LLM Costs
class CostAwareAgent:
def __init__(self):
self.cost_tracker = CostTracker()
self.circuit_breaker = CostCircuitBreaker(
cost_threshold_per_hour=100.00,
error_rate_threshold=0.1
)
def execute_with_cost_control(self, task):
with self.circuit_breaker:
estimated_cost = self.estimate_task_cost(task)
if not self.cost_tracker.can_afford(estimated_cost):
raise CostBudgetExceeded(estimated_cost)
return self.execute(task)
Operational patterns that work:
- Cost circuit breakers prevent runaway LLM bills
- Gradual rollout with synthetic tasks
- A/B testing between agent and human workflows
- Detailed cost attribution per user/department
- Automated rollback on quality degradation
What Actually Works in Production
After building agents that handle millions in transactions and pass government audits, here's what I've learned works:
- Single Agent with Multiple Tools beats multi-agent complexity
- Reactive Execution beats elaborate planning
- Explicit Human Handoffs beats full autonomy
- Custom Monitoring beats framework promises
- User Permission Models beats agent permission models
- Cost Controls beats unlimited LLM access
The most successful production agents I've built look boring: they're essentially smart state machines with LLM reasoning, robust error handling, and clear escalation paths.
They don't book your flights autonomously. They do eliminate 80% of routine work while keeping humans in the loop for everything that matters.
Building production AI agents? Focus on reliability patterns first, AI capabilities second. Your users will thank you when their agents actually work on Tuesday morning.
I'm always interested in comparing notes on production agent patterns. Reach out if you're building something real.