Session 1: From Prototype to Production
Build production-ready multi-agent workflows with error handling, monitoring, and cost tracking
Learning Objectives
By the end of this session, you will be able to:
- Understand the critical differences between prototype and production AI agents
- Build multi-agent workflows with LangGraph (researcher → analyst → writer pattern)
- Implement production error handling with retries, timeouts, and graceful degradation
- Track costs and performance metrics for LLM-based workflows
- Design for scale, reliability, and observability in enterprise environments
Market Overview: The $236B Opportunity
The Numbers
$5.9B → $236B market (44% CAGR) by 2030 | 85% enterprise AI adoption expected by 2025 | 52% of organizations stuck in pilot phase | Average $200K-400K annual savings per workflow
$5.9B → $236B market (44% CAGR) by 2030 | 85% enterprise AI adoption expected by 2025 | 52% of organizations stuck in pilot phase | Average $200K-400K annual savings per workflow
The Production Gap Problem
52% Stuck in Pilot Phase
Most organizations build impressive prototypes but fail to deploy them to production. Why? Lack of error handling, no cost controls, missing monitoring, and unreliable performance. This course bridges that gap.
ROI: $200K-400K Annual Savings
Production-grade agent workflows automate repetitive knowledge work: research, analysis, report generation, data processing. Companies report 40-60% reduction in manual labor costs and 3x faster delivery times.
Enterprise Requirements
Production systems need: 99.9% uptime, comprehensive logging, cost tracking, security compliance, graceful failure handling, and real-time monitoring. Prototypes have none of these.
Real-World Impact
Case Study: Financial Research Automation
A hedge fund deployed a multi-agent research workflow (data gatherer → analyst → report writer) that processes 500+ company filings daily. Result: 75% reduction in analyst time, $350K annual savings, and 24-hour faster insights.
A hedge fund deployed a multi-agent research workflow (data gatherer → analyst → report writer) that processes 500+ company filings daily. Result: 75% reduction in analyst time, $350K annual savings, and 24-hour faster insights.
Technology Stack
┌─────────────────────────────────────┐
│ Monitoring Layer │
│ (Prometheus, Grafana, Logs) │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Orchestration Layer (LangGraph) │
│ - Workflow State Management │
│ - Agent Coordination │
│ - Error Recovery │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Agent Layer (Multi-Agent System) │
│ Researcher → Analyst → Writer │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ LLM Layer (Claude 3.5 Sonnet) │
│ - Reasoning & Decisions │
│ - Content Generation │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Data Sources & APIs │
│ (Search, Databases, External APIs) │
└─────────────────────────────────────┘
Key Technologies
- Python 3.11+: Async/await for concurrent agent execution
- LangGraph: Production-grade workflow orchestration with state persistence
- Claude 3.5 Sonnet: High-quality reasoning for complex tasks
- Structured Logging: JSON logs for observability and debugging
- Prometheus/Grafana: Real-time monitoring and alerting
What You'll Build
By the end of this session, you'll have:
- ✓ Multi-agent research workflow (3 coordinated agents)
- ✓ Production error handling (retries, timeouts, circuit breakers)
- ✓ Structured JSON logging for debugging
- ✓ Cost tracking per agent execution
- ✓ Performance monitoring dashboard
Starter Code vs Solution: Side-by-Side Comparison
01-starter-agent-workflow.py
Python
"""
Multi-Agent Workflow Starter Code
Session 1: From Prototype to Production
Complete the TODO sections to build a production-ready
multi-agent research workflow.
"""
import os
import json
import time
from typing import Dict, List, TypedDict
from dataclasses import dataclass
import anthropic
from langgraph.graph import StateGraph, END
# Configuration
class Config:
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
MODEL = "claude-3-5-sonnet-20241022"
MAX_TOKENS = 2048
# Workflow State
class WorkflowState(TypedDict):
topic: str
research_data: str
analysis: str
final_report: str
error: str
metadata: Dict
# Agent Classes
class ResearchAgent:
def __init__(self, config: Config):
self.config = config
self.client = anthropic.Anthropic(
api_key=config.ANTHROPIC_API_KEY
)
def research(self, topic: str) -> str:
"""
TODO: Implement research agent
Steps:
1. Create prompt asking Claude to research topic
2. Call Claude API
3. Return research findings
4. Add error handling (try/except)
"""
# TODO: Implement research logic
return "TODO: Implement research"
class AnalystAgent:
def __init__(self, config: Config):
self.config = config
self.client = anthropic.Anthropic(
api_key=config.ANTHROPIC_API_KEY
)
def analyze(self, research_data: str) -> str:
"""
TODO: Implement analyst agent
Takes research data and performs analysis
"""
# TODO: Implement analysis logic
return "TODO: Implement analysis"
class WriterAgent:
def __init__(self, config: Config):
self.config = config
self.client = anthropic.Anthropic(
api_key=config.ANTHROPIC_API_KEY
)
def write_report(self, analysis: str) -> str:
"""
TODO: Implement writer agent
Takes analysis and creates final report
"""
# TODO: Implement writing logic
return "TODO: Implement report writing"
# Workflow Nodes
def research_node(state: WorkflowState) -> WorkflowState:
"""
TODO: Implement research workflow node
Steps:
1. Extract topic from state
2. Call ResearchAgent
3. Update state with results
4. Handle errors gracefully
"""
# TODO: Implement node logic
return state
def analysis_node(state: WorkflowState) -> WorkflowState:
"""TODO: Implement analysis node"""
# TODO: Implement node logic
return state
def writing_node(state: WorkflowState) -> WorkflowState:
"""TODO: Implement writing node"""
# TODO: Implement node logic
return state
# Build Workflow Graph
def create_workflow() -> StateGraph:
"""
TODO: Build LangGraph workflow
Flow: research → analysis → writing → END
"""
workflow = StateGraph(WorkflowState)
# TODO: Add nodes
# workflow.add_node("research", research_node)
# workflow.add_node("analysis", analysis_node)
# workflow.add_node("writing", writing_node)
# TODO: Add edges
# workflow.set_entry_point("research")
# workflow.add_edge("research", "analysis")
# workflow.add_edge("analysis", "writing")
# workflow.add_edge("writing", END)
return workflow.compile()
# Main Execution
if __name__ == "__main__":
# Create workflow
workflow = create_workflow()
# Test input
initial_state = {
"topic": "Impact of AI on software development",
"research_data": "",
"analysis": "",
"final_report": "",
"error": "",
"metadata": {}
}
# Run workflow
# TODO: Execute workflow and print results
print("Workflow execution: TODO")
Cost Tracking Dashboard
Monitor and optimize your agent workflow costs in real-time.
cost_tracker.py (excerpt)
Python
def calculate_workflow_costs(
workflow_runs: int,
avg_input_tokens: int,
avg_output_tokens: int,
agents_per_run: int = 3
):
"""Calculate total workflow costs at scale"""
# Claude 3.5 Sonnet pricing (per 1M tokens)
COST_INPUT = 0.003 / 1000 # $3 per 1M input tokens
COST_OUTPUT = 0.015 / 1000 # $15 per 1M output tokens
total_input = workflow_runs * agents_per_run * avg_input_tokens
total_output = workflow_runs * agents_per_run * avg_output_tokens
cost_input = total_input * COST_INPUT
cost_output = total_output * COST_OUTPUT
total_cost = cost_input + cost_output
return {
"workflow_runs": workflow_runs,
"total_input_tokens": total_input,
"total_output_tokens": total_output,
"cost_input_usd": round(cost_input, 2),
"cost_output_usd": round(cost_output, 2),
"total_cost_usd": round(total_cost, 2),
"cost_per_run": round(total_cost / workflow_runs, 4)
}
# Example: 1000 workflow runs
result = calculate_workflow_costs(
workflow_runs=1000,
avg_input_tokens=2000, # per agent
avg_output_tokens=1500 # per agent
)
print(f"Total cost for 1000 runs: ${result['total_cost_usd']}")
print(f"Cost per run: ${result['cost_per_run']}")
Hands-On Exercise: Production Workflow
Prerequisites: Python 3.11+, Anthropic API key, basic understanding of async programming
Exercise Objective
Transform a prototype multi-agent workflow into a production-ready system with error handling, logging, and cost tracking.
Steps
- Error Handling (15 min) - Add try/except blocks, retry logic with exponential backoff, and timeout handling
- Structured Logging (10 min) - Implement JSON logging for all agent operations and workflow steps
- Cost Tracking (10 min) - Track input/output tokens and calculate costs per agent and per workflow
- Performance Monitoring (10 min) - Add timing metrics and success/failure tracking
- Testing (10 min) - Run 5 workflows with different topics and verify all metrics
Success Criteria
- Workflow handles API timeouts gracefully (no crashes)
- Retries 3 times with exponential backoff on failures
- All operations logged in structured JSON format
- Cost tracking accurate to 4 decimal places
- Complete workflow metrics report generated
Test Scenarios
# Test your production workflow with these scenarios:
test_scenarios = [
{
"name": "Happy Path",
"topic": "AI in healthcare diagnostics",
"expected": "Successful completion with full report"
},
{
"name": "Long Topic",
"topic": "Comprehensive analysis of blockchain technology adoption in enterprise financial systems including security, scalability, and regulatory compliance",
"expected": "Handle token limits gracefully"
},
{
"name": "Rate Limit Simulation",
"topic": "Run 10 workflows in parallel",
"expected": "Retry logic handles rate limits"
},
{
"name": "Cost Analysis",
"topic": "Any topic",
"expected": "Accurate cost breakdown per agent"
}
]
# Verify metrics
required_metrics = [
"total_input_tokens",
"total_output_tokens",
"total_cost_usd",
"duration_seconds",
"status",
"agent_breakdown"
]
Common Issues & Solutions
Retry Logic Not Working
Issue: Retries not triggering on errors
Solution: Ensure you're catching the correct exception types (anthropic.APITimeoutError, anthropic.RateLimitError). Use exponential backoff: wait_time = retry_delay * (2 ** attempt). Log each retry attempt for debugging.
Solution: Ensure you're catching the correct exception types (anthropic.APITimeoutError, anthropic.RateLimitError). Use exponential backoff: wait_time = retry_delay * (2 ** attempt). Log each retry attempt for debugging.
Cost Tracking Inaccurate
Issue: Costs don't match API usage
Solution: Verify you're using response.usage.input_tokens and response.usage.output_tokens (not estimating). Check pricing constants match current rates. Accumulate costs across all retries and agents.
Solution: Verify you're using response.usage.input_tokens and response.usage.output_tokens (not estimating). Check pricing constants match current rates. Accumulate costs across all retries and agents.
Workflow Hangs Indefinitely
Issue: No timeout set on API calls
Solution: Always set timeout parameter in client.messages.create(timeout=30). Use asyncio.wait_for() for async calls. Implement circuit breaker pattern for repeated failures.
Solution: Always set timeout parameter in client.messages.create(timeout=30). Use asyncio.wait_for() for async calls. Implement circuit breaker pattern for repeated failures.
Extension Challenges
Bonus: Add Prometheus metrics export, implement workflow state persistence (resume failed workflows), create Grafana dashboard, or add circuit breaker pattern for cascading failures.
Session 1 Quiz: Production AI Workflows
Question 1 of 10
10 points
What percentage of organizations are stuck in the pilot phase with AI agents?
Correct! 52% of organizations struggle to move AI agents from prototype to production, highlighting the critical need for production-ready skills.
Question 2 of 10
10 points
What is the average annual savings from deploying production agent workflows?
Correct! Companies report $200K-400K in annual savings by automating knowledge work with multi-agent workflows.
Question 3 of 10
10 points
Which of the following is NOT a critical production requirement?
Correct! Production systems need error handling, logging, and monitoring. UI aesthetics are secondary to reliability and observability.
Question 4 of 10
10 points
What retry strategy should you use for transient API failures?
Correct! Exponential backoff prevents overwhelming the API during issues and gives time for recovery. Wait time doubles with each retry.
Question 5 of 10
10 points
What is the purpose of structured logging in production systems?
Correct! Structured JSON logs enable automated parsing, searching, and analysis by monitoring tools like ELK stack or Datadog.
Question 6 of 10
10 points
How are Claude 3.5 Sonnet API costs calculated?
Correct! Costs are calculated per token: $3/1M input tokens and $15/1M output tokens for Claude 3.5 Sonnet.
Question 7 of 10
10 points
What is the recommended default timeout for LLM API calls?
Correct! 30 seconds is a good balance - enough for complex requests but prevents hanging indefinitely.
Question 8 of 10
10 points
In the multi-agent workflow pattern, what is the typical flow?
Correct! The sequential pattern (research → analysis → writing) mirrors how humans work and produces higher quality results.
Question 9 of 10
10 points
What should you do when an agent step fails after max retries?
Correct! Graceful failure with detailed logging allows debugging and potential manual recovery without crashing the system.
Question 10 of 10
10 points
Which metric is MOST important for production monitoring?
Correct! Service Level Indicators (SLIs) like success rate, latency, and error rates directly measure user-facing reliability.
Homework Assignment: Production-Ready Workflow
Build a production-grade multi-agent workflow with enterprise-level reliability and monitoring.
Assignment Overview
Transform your Session 1 workflow into a production system that meets enterprise standards for error handling, observability, cost management, and recovery.
Tasks
- Part 1: Retry Logic with Exponential Backoff (25%)
- Implement retry logic for all API calls (3 attempts)
- Use exponential backoff: wait_time = 2 * (2 ** attempt)
- Handle specific exceptions: TimeoutError, RateLimitError, APIError
- Log each retry attempt with structured data
- Part 2: Structured Logging (JSON) (25%)
- Implement JSON-formatted logging for all operations
- Include: timestamp, event type, agent name, duration, status
- Log workflow start, each agent step, and completion
- Create log analysis script to extract metrics
- Part 3: Cost Tracking Dashboard (25%)
- Track input/output tokens for each agent call
- Calculate costs using Claude pricing ($3/$15 per 1M tokens)
- Generate cost report by agent and by workflow
- Create simple visualization (text-based or matplotlib)
- Part 4: Failure Recovery Workflow (25%)
- Design workflow that can recover from agent failures
- Implement state persistence (save/load workflow state)
- Add fallback responses for failed agents
- Test recovery by simulating failures
Deliverables Checklist
- Python code with complete error handling and retry logic
- Structured JSON logging throughout workflow
- Cost tracking system with breakdown by agent
- State persistence (save/resume workflows)
- Test suite demonstrating failure recovery
- README with architecture notes and usage examples
Grading Rubric
| Component | Points | Criteria |
|---|---|---|
| Retry Logic | 25 | Exponential backoff, handles all error types, logged attempts |
| Structured Logging | 25 | JSON format, comprehensive coverage, analysis script |
| Cost Tracking | 25 | Accurate token counting, cost calculation, visual dashboard |
| Failure Recovery | 25 | State persistence, recovery logic, fallback handling |
Submission Details
Due Date: 1 week from session date
Format: GitHub repository with README and examples
Include: Code, sample logs, cost reports, test results
Format: GitHub repository with README and examples
Include: Code, sample logs, cost reports, test results
Bonus Challenges (+10 points each)
- Prometheus Metrics: Export metrics to Prometheus format for Grafana dashboards
- Circuit Breaker: Implement circuit breaker pattern to prevent cascading failures
- Async Execution: Parallelize independent agent operations using asyncio
- Rate Limiting: Implement token bucket algorithm for API rate limit management
Session Resources
Documentation
Code Examples
- 01-starter-agent-workflow.py
- 01-solution-agent-workflow.py
- cost_tracker.py
- structured_logging.py
Tutorials & Articles
- Video: Production AI Patterns (30 min)
- Article: Enterprise AI Market Analysis 2024
- Guide: Monitoring AI Agents in Production