Back to Course Hub

Session 1: From Prototype to Production

Build production-ready multi-agent workflows with error handling, monitoring, and cost tracking

90-120 minutes

Hands-on + Live Demo

Python 3.11+, LangGraph, Monitoring

10-question quiz

Learning Objectives

By the end of this session, you will be able to:

Understand the critical differences between prototype and production AI agents
Build multi-agent workflows with LangGraph (researcher → analyst → writer pattern)
Implement production error handling with retries, timeouts, and graceful degradation
Track costs and performance metrics for LLM-based workflows
Design for scale, reliability, and observability in enterprise environments

Market Overview: The $236B Opportunity

The Numbers
$5.9B → $236B market (44% CAGR) by 2030 | 85% enterprise AI adoption expected by 2025 | 52% of organizations stuck in pilot phase | Average $200K-400K annual savings per workflow

The Production Gap Problem

52% Stuck in Pilot Phase

Most organizations build impressive prototypes but fail to deploy them to production. Why? Lack of error handling, no cost controls, missing monitoring, and unreliable performance. This course bridges that gap.

ROI: $200K-400K Annual Savings

Production-grade agent workflows automate repetitive knowledge work: research, analysis, report generation, data processing. Companies report 40-60% reduction in manual labor costs and 3x faster delivery times.

Enterprise Requirements

Production systems need: 99.9% uptime, comprehensive logging, cost tracking, security compliance, graceful failure handling, and real-time monitoring. Prototypes have none of these.

Real-World Impact

Case Study: Financial Research Automation
A hedge fund deployed a multi-agent research workflow (data gatherer → analyst → report writer) that processes 500+ company filings daily. Result: 75% reduction in analyst time, $350K annual savings, and 24-hour faster insights.

Technology Stack

┌─────────────────────────────────────┐
│  Monitoring Layer                   │
│  (Prometheus, Grafana, Logs)        │
└─────────────────────────────────────┘
           ↓
┌─────────────────────────────────────┐
│  Orchestration Layer (LangGraph)    │
│  - Workflow State Management        │
│  - Agent Coordination               │
│  - Error Recovery                   │
└─────────────────────────────────────┘
           ↓
┌─────────────────────────────────────┐
│  Agent Layer (Multi-Agent System)   │
│  Researcher → Analyst → Writer      │
└─────────────────────────────────────┘
           ↓
┌─────────────────────────────────────┐
│  LLM Layer (Claude 3.5 Sonnet)      │
│  - Reasoning & Decisions            │
│  - Content Generation               │
└─────────────────────────────────────┘
           ↓
┌─────────────────────────────────────┐
│  Data Sources & APIs                │
│  (Search, Databases, External APIs) │
└─────────────────────────────────────┘

Key Technologies

Python 3.11+: Async/await for concurrent agent execution
LangGraph: Production-grade workflow orchestration with state persistence
Claude 3.5 Sonnet: High-quality reasoning for complex tasks
Structured Logging: JSON logs for observability and debugging
Prometheus/Grafana: Real-time monitoring and alerting

What You'll Build

By the end of this session, you'll have:

✓ Multi-agent research workflow (3 coordinated agents)
✓ Production error handling (retries, timeouts, circuit breakers)
✓ Structured JSON logging for debugging
✓ Cost tracking per agent execution
✓ Performance monitoring dashboard

Starter Code vs Solution: Side-by-Side Comparison

01-starter-agent-workflow.py Python

"""
Multi-Agent Workflow Starter Code
Session 1: From Prototype to Production

Complete the TODO sections to build a production-ready
multi-agent research workflow.
"""

import os
import json
import time
from typing import Dict, List, TypedDict
from dataclasses import dataclass

import anthropic
from langgraph.graph import StateGraph, END

# Configuration
class Config:
    ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
    MODEL = "claude-3-5-sonnet-20241022"
    MAX_TOKENS = 2048

# Workflow State
class WorkflowState(TypedDict):
    topic: str
    research_data: str
    analysis: str
    final_report: str
    error: str
    metadata: Dict

# Agent Classes
class ResearchAgent:
    def __init__(self, config: Config):
        self.config = config
        self.client = anthropic.Anthropic(
            api_key=config.ANTHROPIC_API_KEY
        )

    def research(self, topic: str) -> str:
        """
        TODO: Implement research agent

        Steps:
        1. Create prompt asking Claude to research topic
        2. Call Claude API
        3. Return research findings
        4. Add error handling (try/except)
        """
        # TODO: Implement research logic
        return "TODO: Implement research"

class AnalystAgent:
    def __init__(self, config: Config):
        self.config = config
        self.client = anthropic.Anthropic(
            api_key=config.ANTHROPIC_API_KEY
        )

    def analyze(self, research_data: str) -> str:
        """
        TODO: Implement analyst agent

        Takes research data and performs analysis
        """
        # TODO: Implement analysis logic
        return "TODO: Implement analysis"

class WriterAgent:
    def __init__(self, config: Config):
        self.config = config
        self.client = anthropic.Anthropic(
            api_key=config.ANTHROPIC_API_KEY
        )

    def write_report(self, analysis: str) -> str:
        """
        TODO: Implement writer agent

        Takes analysis and creates final report
        """
        # TODO: Implement writing logic
        return "TODO: Implement report writing"

# Workflow Nodes
def research_node(state: WorkflowState) -> WorkflowState:
    """
    TODO: Implement research workflow node

    Steps:
    1. Extract topic from state
    2. Call ResearchAgent
    3. Update state with results
    4. Handle errors gracefully
    """
    # TODO: Implement node logic
    return state

def analysis_node(state: WorkflowState) -> WorkflowState:
    """TODO: Implement analysis node"""
    # TODO: Implement node logic
    return state

def writing_node(state: WorkflowState) -> WorkflowState:
    """TODO: Implement writing node"""
    # TODO: Implement node logic
    return state

# Build Workflow Graph
def create_workflow() -> StateGraph:
    """
    TODO: Build LangGraph workflow

    Flow: research → analysis → writing → END
    """
    workflow = StateGraph(WorkflowState)

    # TODO: Add nodes
    # workflow.add_node("research", research_node)
    # workflow.add_node("analysis", analysis_node)
    # workflow.add_node("writing", writing_node)

    # TODO: Add edges
    # workflow.set_entry_point("research")
    # workflow.add_edge("research", "analysis")
    # workflow.add_edge("analysis", "writing")
    # workflow.add_edge("writing", END)

    return workflow.compile()

# Main Execution
if __name__ == "__main__":
    # Create workflow
    workflow = create_workflow()

    # Test input
    initial_state = {
        "topic": "Impact of AI on software development",
        "research_data": "",
        "analysis": "",
        "final_report": "",
        "error": "",
        "metadata": {}
    }

    # Run workflow
    # TODO: Execute workflow and print results
    print("Workflow execution: TODO")

01-solution-agent-workflow.py Python

"""
Multi-Agent Workflow - Complete Solution
Production-ready implementation with error handling,
cost tracking, structured logging, and monitoring.
"""

import os
import json
import time
import logging
from typing import Dict, List, TypedDict, Optional
from dataclasses import dataclass, field
from datetime import datetime
import traceback

import anthropic

# Configure structured logging
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s'
)
logger = logging.getLogger(__name__)

def log_structured(event: str, **kwargs):
    """Structured JSON logging for production observability"""
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "event": event,
        **kwargs
    }
    logger.info(json.dumps(log_entry))

# Configuration
class Config:
    ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
    MODEL = "claude-3-5-sonnet-20241022"
    MAX_TOKENS = 2048
    TIMEOUT = 30  # seconds
    MAX_RETRIES = 3
    RETRY_DELAY = 2  # seconds

    # Cost tracking (per 1M tokens)
    COST_PER_INPUT_TOKEN = 0.003 / 1000
    COST_PER_OUTPUT_TOKEN = 0.015 / 1000

# Workflow State
class WorkflowState(TypedDict):
    topic: str
    research_data: str
    analysis: str
    final_report: str
    error: Optional[str]
    metadata: Dict

# Cost Tracker
@dataclass
class CostTracker:
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    total_cost: float = 0.0
    agent_costs: Dict[str, float] = field(default_factory=dict)

    def add_usage(self, agent_name: str, input_tokens: int, output_tokens: int):
        cost = (
            input_tokens * Config.COST_PER_INPUT_TOKEN +
            output_tokens * Config.COST_PER_OUTPUT_TOKEN
        )
        self.total_input_tokens += input_tokens
        self.total_output_tokens += output_tokens
        self.total_cost += cost

        if agent_name not in self.agent_costs:
            self.agent_costs[agent_name] = 0.0
        self.agent_costs[agent_name] += cost

        log_structured(
            "cost_tracking",
            agent=agent_name,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost_usd=round(cost, 4)
        )

    def get_report(self) -> Dict:
        return {
            "total_input_tokens": self.total_input_tokens,
            "total_output_tokens": self.total_output_tokens,
            "total_cost_usd": round(self.total_cost, 4),
            "agent_breakdown": {
                agent: round(cost, 4)
                for agent, cost in self.agent_costs.items()
            }
        }

# Global cost tracker
cost_tracker = CostTracker()

# Base Agent with production features
class BaseAgent:
    def __init__(self, name: str, config: Config):
        self.name = name
        self.config = config
        self.client = anthropic.Anthropic(
            api_key=config.ANTHROPIC_API_KEY
        )

    def call_llm(
        self,
        system_prompt: str,
        user_message: str,
        max_retries: int = None
    ) -> str:
        """
        Production LLM call with:
        - Retry logic with exponential backoff
        - Timeout handling
        - Error recovery
        - Cost tracking
        """
        retries = max_retries or self.config.MAX_RETRIES

        for attempt in range(retries):
            try:
                log_structured(
                    "llm_call_start",
                    agent=self.name,
                    attempt=attempt + 1,
                    max_retries=retries
                )

                start_time = time.time()

                response = self.client.messages.create(
                    model=self.config.MODEL,
                    max_tokens=self.config.MAX_TOKENS,
                    system=system_prompt,
                    messages=[{
                        "role": "user",
                        "content": user_message
                    }],
                    timeout=self.config.TIMEOUT
                )

                duration = time.time() - start_time

                # Track costs
                cost_tracker.add_usage(
                    self.name,
                    response.usage.input_tokens,
                    response.usage.output_tokens
                )

                # Extract response
                result = response.content[0].text

                log_structured(
                    "llm_call_success",
                    agent=self.name,
                    duration_seconds=round(duration, 2),
                    output_length=len(result)
                )

                return result

            except anthropic.APITimeoutError as e:
                log_structured(
                    "llm_call_timeout",
                    agent=self.name,
                    attempt=attempt + 1,
                    error=str(e)
                )

                if attempt < retries - 1:
                    wait_time = self.config.RETRY_DELAY * (2 ** attempt)
                    time.sleep(wait_time)
                    continue
                else:
                    raise

            except anthropic.RateLimitError as e:
                log_structured(
                    "llm_call_rate_limit",
                    agent=self.name,
                    attempt=attempt + 1
                )

                if attempt < retries - 1:
                    wait_time = self.config.RETRY_DELAY * (2 ** attempt)
                    time.sleep(wait_time)
                    continue
                else:
                    raise

            except Exception as e:
                log_structured(
                    "llm_call_error",
                    agent=self.name,
                    attempt=attempt + 1,
                    error=str(e),
                    traceback=traceback.format_exc()
                )

                if attempt < retries - 1:
                    time.sleep(self.config.RETRY_DELAY)
                    continue
                else:
                    raise

        raise Exception(f"Max retries ({retries}) exceeded")

# Research Agent
class ResearchAgent(BaseAgent):
    def __init__(self, config: Config):
        super().__init__("researcher", config)

    def research(self, topic: str) -> str:
        """Gather research data on topic"""
        log_structured("agent_start", agent=self.name, topic=topic)

        system_prompt = """You are a professional research agent.
Your task is to gather comprehensive information on the given topic.

Focus on:
1. Key facts and data points
2. Current trends and developments
3. Expert opinions and analysis
4. Relevant statistics

Provide well-structured research findings in 3-4 paragraphs."""

        user_message = f"Research the following topic:\n\n{topic}"

        try:
            result = self.call_llm(system_prompt, user_message)
            log_structured("agent_complete", agent=self.name, success=True)
            return result
        except Exception as e:
            log_structured("agent_failed", agent=self.name, error=str(e))
            return f"Research failed: {str(e)}"

# Analyst Agent
class AnalystAgent(BaseAgent):
    def __init__(self, config: Config):
        super().__init__("analyst", config)

    def analyze(self, research_data: str) -> str:
        """Analyze research data"""
        log_structured("agent_start", agent=self.name)

        system_prompt = """You are an expert analyst.
Your task is to analyze the provided research data and extract insights.

Focus on:
1. Key patterns and trends
2. Implications and consequences
3. Opportunities and risks
4. Strategic recommendations

Provide structured analysis in 3-4 paragraphs."""

        user_message = f"Analyze this research data:\n\n{research_data}"

        try:
            result = self.call_llm(system_prompt, user_message)
            log_structured("agent_complete", agent=self.name, success=True)
            return result
        except Exception as e:
            log_structured("agent_failed", agent=self.name, error=str(e))
            return f"Analysis failed: {str(e)}"

# Writer Agent
class WriterAgent(BaseAgent):
    def __init__(self, config: Config):
        super().__init__("writer", config)

    def write_report(self, analysis: str, topic: str) -> str:
        """Write final report"""
        log_structured("agent_start", agent=self.name)

        system_prompt = """You are a professional report writer.
Your task is to create a clear, compelling executive report.

Format:
1. Executive Summary (2-3 sentences)
2. Key Findings (bullet points)
3. Detailed Analysis (2-3 paragraphs)
4. Recommendations (bullet points)

Use clear, professional language suitable for decision-makers."""

        user_message = f"""Create an executive report on: {topic}

Based on this analysis:
{analysis}"""

        try:
            result = self.call_llm(system_prompt, user_message)
            log_structured("agent_complete", agent=self.name, success=True)
            return result
        except Exception as e:
            log_structured("agent_failed", agent=self.name, error=str(e))
            return f"Report writing failed: {str(e)}"

# Workflow Orchestration
class MultiAgentWorkflow:
    def __init__(self, config: Config):
        self.config = config
        self.researcher = ResearchAgent(config)
        self.analyst = AnalystAgent(config)
        self.writer = WriterAgent(config)

    def execute(self, topic: str) -> Dict:
        """Execute full workflow with error handling"""
        log_structured("workflow_start", topic=topic)
        start_time = time.time()

        state = {
            "topic": topic,
            "research_data": "",
            "analysis": "",
            "final_report": "",
            "error": None,
            "metadata": {
                "start_time": datetime.utcnow().isoformat(),
                "status": "running"
            }
        }

        try:
            # Step 1: Research
            log_structured("workflow_step", step="research")
            state["research_data"] = self.researcher.research(topic)

            if "failed" in state["research_data"].lower():
                raise Exception("Research step failed")

            # Step 2: Analysis
            log_structured("workflow_step", step="analysis")
            state["analysis"] = self.analyst.analyze(state["research_data"])

            if "failed" in state["analysis"].lower():
                raise Exception("Analysis step failed")

            # Step 3: Writing
            log_structured("workflow_step", step="writing")
            state["final_report"] = self.writer.write_report(
                state["analysis"],
                topic
            )

            if "failed" in state["final_report"].lower():
                raise Exception("Writing step failed")

            # Success
            duration = time.time() - start_time
            state["metadata"].update({
                "status": "completed",
                "duration_seconds": round(duration, 2),
                "end_time": datetime.utcnow().isoformat(),
                "costs": cost_tracker.get_report()
            })

            log_structured(
                "workflow_complete",
                success=True,
                duration_seconds=round(duration, 2)
            )

        except Exception as e:
            duration = time.time() - start_time
            state["error"] = str(e)
            state["metadata"].update({
                "status": "failed",
                "duration_seconds": round(duration, 2),
                "end_time": datetime.utcnow().isoformat(),
                "error": str(e)
            })

            log_structured(
                "workflow_failed",
                error=str(e),
                traceback=traceback.format_exc()
            )

        return state

# Main Execution
if __name__ == "__main__":
    print("=== Multi-Agent Workflow Demo ===\n")

    # Create workflow
    workflow = MultiAgentWorkflow(Config)

    # Test topics
    topics = [
        "Impact of AI on software development productivity",
        "Future of remote work in tech companies",
        "Quantum computing applications in finance"
    ]

    print("Available topics:")
    for i, topic in enumerate(topics, 1):
        print(f"  {i}. {topic}")

    choice = input("\nChoose topic (1-3) or enter custom: ").strip()

    if choice.isdigit() and 1 <= int(choice) <= len(topics):
        topic = topics[int(choice) - 1]
    else:
        topic = choice if choice else topics[0]

    print(f"\n{'='*60}")
    print(f"Processing: {topic}")
    print(f"{'='*60}\n")

    # Execute workflow
    result = workflow.execute(topic)

    # Display results
    if result["error"]:
        print(f"\n❌ Workflow failed: {result['error']}\n")
    else:
        print("\n" + "="*60)
        print("FINAL REPORT")
        print("="*60 + "\n")
        print(result["final_report"])
        print("\n" + "="*60)

    # Display metrics
    print("\n" + "="*60)
    print("WORKFLOW METRICS")
    print("="*60)
    print(f"Status: {result['metadata']['status']}")
    print(f"Duration: {result['metadata']['duration_seconds']}s")

    if "costs" in result["metadata"]:
        costs = result["metadata"]["costs"]
        print(f"\nCost Breakdown:")
        print(f"  Total: ${costs['total_cost_usd']}")
        print(f"  Input tokens: {costs['total_input_tokens']}")
        print(f"  Output tokens: {costs['total_output_tokens']}")
        print(f"\n  By Agent:")
        for agent, cost in costs['agent_breakdown'].items():
            print(f"    {agent}: ${cost}")

    print("\n=== Demo Complete ===")

Cost Tracking Dashboard

Monitor and optimize your agent workflow costs in real-time.

cost_tracker.py (excerpt) Python

def calculate_workflow_costs(
    workflow_runs: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    agents_per_run: int = 3
):
    """Calculate total workflow costs at scale"""

    # Claude 3.5 Sonnet pricing (per 1M tokens)
    COST_INPUT = 0.003 / 1000   # $3 per 1M input tokens
    COST_OUTPUT = 0.015 / 1000  # $15 per 1M output tokens

    total_input = workflow_runs * agents_per_run * avg_input_tokens
    total_output = workflow_runs * agents_per_run * avg_output_tokens

    cost_input = total_input * COST_INPUT
    cost_output = total_output * COST_OUTPUT
    total_cost = cost_input + cost_output

    return {
        "workflow_runs": workflow_runs,
        "total_input_tokens": total_input,
        "total_output_tokens": total_output,
        "cost_input_usd": round(cost_input, 2),
        "cost_output_usd": round(cost_output, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_run": round(total_cost / workflow_runs, 4)
    }

# Example: 1000 workflow runs
result = calculate_workflow_costs(
    workflow_runs=1000,
    avg_input_tokens=2000,  # per agent
    avg_output_tokens=1500  # per agent
)

print(f"Total cost for 1000 runs: ${result['total_cost_usd']}")
print(f"Cost per run: ${result['cost_per_run']}")

Hands-On Exercise: Production Workflow

Prerequisites: Python 3.11+, Anthropic API key, basic understanding of async programming

Exercise Objective

Transform a prototype multi-agent workflow into a production-ready system with error handling, logging, and cost tracking.

Steps

Error Handling (15 min) - Add try/except blocks, retry logic with exponential backoff, and timeout handling
Structured Logging (10 min) - Implement JSON logging for all agent operations and workflow steps
Cost Tracking (10 min) - Track input/output tokens and calculate costs per agent and per workflow
Performance Monitoring (10 min) - Add timing metrics and success/failure tracking
Testing (10 min) - Run 5 workflows with different topics and verify all metrics

Success Criteria

Workflow handles API timeouts gracefully (no crashes)
Retries 3 times with exponential backoff on failures
All operations logged in structured JSON format
Cost tracking accurate to 4 decimal places
Complete workflow metrics report generated

Test Scenarios

# Test your production workflow with these scenarios:

test_scenarios = [
    {
        "name": "Happy Path",
        "topic": "AI in healthcare diagnostics",
        "expected": "Successful completion with full report"
    },
    {
        "name": "Long Topic",
        "topic": "Comprehensive analysis of blockchain technology adoption in enterprise financial systems including security, scalability, and regulatory compliance",
        "expected": "Handle token limits gracefully"
    },
    {
        "name": "Rate Limit Simulation",
        "topic": "Run 10 workflows in parallel",
        "expected": "Retry logic handles rate limits"
    },
    {
        "name": "Cost Analysis",
        "topic": "Any topic",
        "expected": "Accurate cost breakdown per agent"
    }
]

# Verify metrics
required_metrics = [
    "total_input_tokens",
    "total_output_tokens",
    "total_cost_usd",
    "duration_seconds",
    "status",
    "agent_breakdown"
]

Common Issues & Solutions

Retry Logic Not Working

Issue: Retries not triggering on errors
Solution: Ensure you're catching the correct exception types (anthropic.APITimeoutError, anthropic.RateLimitError). Use exponential backoff: wait_time = retry_delay * (2 ** attempt). Log each retry attempt for debugging.

Cost Tracking Inaccurate

Issue: Costs don't match API usage
Solution: Verify you're using response.usage.input_tokens and response.usage.output_tokens (not estimating). Check pricing constants match current rates. Accumulate costs across all retries and agents.

Workflow Hangs Indefinitely

Issue: No timeout set on API calls
Solution: Always set timeout parameter in client.messages.create(timeout=30). Use asyncio.wait_for() for async calls. Implement circuit breaker pattern for repeated failures.

Extension Challenges

Bonus: Add Prometheus metrics export, implement workflow state persistence (resume failed workflows), create Grafana dashboard, or add circuit breaker pattern for cascading failures.

Homework Assignment: Production-Ready Workflow

Build a production-grade multi-agent workflow with enterprise-level reliability and monitoring.

Assignment Overview

Transform your Session 1 workflow into a production system that meets enterprise standards for error handling, observability, cost management, and recovery.

Tasks

Part 1: Retry Logic with Exponential Backoff (25%)
- Implement retry logic for all API calls (3 attempts)
- Use exponential backoff: wait_time = 2 * (2 ** attempt)
- Handle specific exceptions: TimeoutError, RateLimitError, APIError
- Log each retry attempt with structured data
Part 2: Structured Logging (JSON) (25%)
- Implement JSON-formatted logging for all operations
- Include: timestamp, event type, agent name, duration, status
- Log workflow start, each agent step, and completion
- Create log analysis script to extract metrics
Part 3: Cost Tracking Dashboard (25%)
- Track input/output tokens for each agent call
- Calculate costs using Claude pricing ($3/$15 per 1M tokens)
- Generate cost report by agent and by workflow
- Create simple visualization (text-based or matplotlib)
Part 4: Failure Recovery Workflow (25%)
- Design workflow that can recover from agent failures
- Implement state persistence (save/load workflow state)
- Add fallback responses for failed agents
- Test recovery by simulating failures

Deliverables Checklist

Python code with complete error handling and retry logic
Structured JSON logging throughout workflow
Cost tracking system with breakdown by agent
State persistence (save/resume workflows)
Test suite demonstrating failure recovery
README with architecture notes and usage examples

Grading Rubric

Component	Points	Criteria
Retry Logic	25	Exponential backoff, handles all error types, logged attempts
Structured Logging	25	JSON format, comprehensive coverage, analysis script
Cost Tracking	25	Accurate token counting, cost calculation, visual dashboard
Failure Recovery	25	State persistence, recovery logic, fallback handling

Submission Details

Due Date: 1 week from session date
Format: GitHub repository with README and examples
Include: Code, sample logs, cost reports, test results

Bonus Challenges (+10 points each)

Prometheus Metrics: Export metrics to Prometheus format for Grafana dashboards
Circuit Breaker: Implement circuit breaker pattern to prevent cascading failures
Async Execution: Parallelize independent agent operations using asyncio
Rate Limiting: Implement token bucket algorithm for API rate limit management