Back to Course Hub

Learning Objectives

By the end of this session, you will be able to:

  • Understand the critical differences between prototype and production AI agents
  • Build multi-agent workflows with LangGraph (researcher → analyst → writer pattern)
  • Implement production error handling with retries, timeouts, and graceful degradation
  • Track costs and performance metrics for LLM-based workflows
  • Design for scale, reliability, and observability in enterprise environments

Market Overview: The $236B Opportunity

The Numbers
$5.9B → $236B market (44% CAGR) by 2030 | 85% enterprise AI adoption expected by 2025 | 52% of organizations stuck in pilot phase | Average $200K-400K annual savings per workflow

The Production Gap Problem

52% Stuck in Pilot Phase
Most organizations build impressive prototypes but fail to deploy them to production. Why? Lack of error handling, no cost controls, missing monitoring, and unreliable performance. This course bridges that gap.
ROI: $200K-400K Annual Savings
Production-grade agent workflows automate repetitive knowledge work: research, analysis, report generation, data processing. Companies report 40-60% reduction in manual labor costs and 3x faster delivery times.
Enterprise Requirements
Production systems need: 99.9% uptime, comprehensive logging, cost tracking, security compliance, graceful failure handling, and real-time monitoring. Prototypes have none of these.

Real-World Impact

Case Study: Financial Research Automation
A hedge fund deployed a multi-agent research workflow (data gatherer → analyst → report writer) that processes 500+ company filings daily. Result: 75% reduction in analyst time, $350K annual savings, and 24-hour faster insights.

Technology Stack

┌─────────────────────────────────────┐
│  Monitoring Layer                   │
│  (Prometheus, Grafana, Logs)        │
└─────────────────────────────────────┘
           ↓
┌─────────────────────────────────────┐
│  Orchestration Layer (LangGraph)    │
│  - Workflow State Management        │
│  - Agent Coordination               │
│  - Error Recovery                   │
└─────────────────────────────────────┘
           ↓
┌─────────────────────────────────────┐
│  Agent Layer (Multi-Agent System)   │
│  Researcher → Analyst → Writer      │
└─────────────────────────────────────┘
           ↓
┌─────────────────────────────────────┐
│  LLM Layer (Claude 3.5 Sonnet)      │
│  - Reasoning & Decisions            │
│  - Content Generation               │
└─────────────────────────────────────┘
           ↓
┌─────────────────────────────────────┐
│  Data Sources & APIs                │
│  (Search, Databases, External APIs) │
└─────────────────────────────────────┘

Key Technologies

  • Python 3.11+: Async/await for concurrent agent execution
  • LangGraph: Production-grade workflow orchestration with state persistence
  • Claude 3.5 Sonnet: High-quality reasoning for complex tasks
  • Structured Logging: JSON logs for observability and debugging
  • Prometheus/Grafana: Real-time monitoring and alerting

What You'll Build

By the end of this session, you'll have:

  • ✓ Multi-agent research workflow (3 coordinated agents)
  • ✓ Production error handling (retries, timeouts, circuit breakers)
  • ✓ Structured JSON logging for debugging
  • ✓ Cost tracking per agent execution
  • ✓ Performance monitoring dashboard

Starter Code vs Solution: Side-by-Side Comparison

01-starter-agent-workflow.py Python
"""
Multi-Agent Workflow Starter Code
Session 1: From Prototype to Production

Complete the TODO sections to build a production-ready
multi-agent research workflow.
"""

import os
import json
import time
from typing import Dict, List, TypedDict
from dataclasses import dataclass

import anthropic
from langgraph.graph import StateGraph, END

# Configuration
class Config:
    ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
    MODEL = "claude-3-5-sonnet-20241022"
    MAX_TOKENS = 2048

# Workflow State
class WorkflowState(TypedDict):
    topic: str
    research_data: str
    analysis: str
    final_report: str
    error: str
    metadata: Dict

# Agent Classes
class ResearchAgent:
    def __init__(self, config: Config):
        self.config = config
        self.client = anthropic.Anthropic(
            api_key=config.ANTHROPIC_API_KEY
        )

    def research(self, topic: str) -> str:
        """
        TODO: Implement research agent

        Steps:
        1. Create prompt asking Claude to research topic
        2. Call Claude API
        3. Return research findings
        4. Add error handling (try/except)
        """
        # TODO: Implement research logic
        return "TODO: Implement research"

class AnalystAgent:
    def __init__(self, config: Config):
        self.config = config
        self.client = anthropic.Anthropic(
            api_key=config.ANTHROPIC_API_KEY
        )

    def analyze(self, research_data: str) -> str:
        """
        TODO: Implement analyst agent

        Takes research data and performs analysis
        """
        # TODO: Implement analysis logic
        return "TODO: Implement analysis"

class WriterAgent:
    def __init__(self, config: Config):
        self.config = config
        self.client = anthropic.Anthropic(
            api_key=config.ANTHROPIC_API_KEY
        )

    def write_report(self, analysis: str) -> str:
        """
        TODO: Implement writer agent

        Takes analysis and creates final report
        """
        # TODO: Implement writing logic
        return "TODO: Implement report writing"

# Workflow Nodes
def research_node(state: WorkflowState) -> WorkflowState:
    """
    TODO: Implement research workflow node

    Steps:
    1. Extract topic from state
    2. Call ResearchAgent
    3. Update state with results
    4. Handle errors gracefully
    """
    # TODO: Implement node logic
    return state

def analysis_node(state: WorkflowState) -> WorkflowState:
    """TODO: Implement analysis node"""
    # TODO: Implement node logic
    return state

def writing_node(state: WorkflowState) -> WorkflowState:
    """TODO: Implement writing node"""
    # TODO: Implement node logic
    return state

# Build Workflow Graph
def create_workflow() -> StateGraph:
    """
    TODO: Build LangGraph workflow

    Flow: research → analysis → writing → END
    """
    workflow = StateGraph(WorkflowState)

    # TODO: Add nodes
    # workflow.add_node("research", research_node)
    # workflow.add_node("analysis", analysis_node)
    # workflow.add_node("writing", writing_node)

    # TODO: Add edges
    # workflow.set_entry_point("research")
    # workflow.add_edge("research", "analysis")
    # workflow.add_edge("analysis", "writing")
    # workflow.add_edge("writing", END)

    return workflow.compile()

# Main Execution
if __name__ == "__main__":
    # Create workflow
    workflow = create_workflow()

    # Test input
    initial_state = {
        "topic": "Impact of AI on software development",
        "research_data": "",
        "analysis": "",
        "final_report": "",
        "error": "",
        "metadata": {}
    }

    # Run workflow
    # TODO: Execute workflow and print results
    print("Workflow execution: TODO")

Cost Tracking Dashboard

Monitor and optimize your agent workflow costs in real-time.

cost_tracker.py (excerpt) Python
def calculate_workflow_costs(
    workflow_runs: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    agents_per_run: int = 3
):
    """Calculate total workflow costs at scale"""

    # Claude 3.5 Sonnet pricing (per 1M tokens)
    COST_INPUT = 0.003 / 1000   # $3 per 1M input tokens
    COST_OUTPUT = 0.015 / 1000  # $15 per 1M output tokens

    total_input = workflow_runs * agents_per_run * avg_input_tokens
    total_output = workflow_runs * agents_per_run * avg_output_tokens

    cost_input = total_input * COST_INPUT
    cost_output = total_output * COST_OUTPUT
    total_cost = cost_input + cost_output

    return {
        "workflow_runs": workflow_runs,
        "total_input_tokens": total_input,
        "total_output_tokens": total_output,
        "cost_input_usd": round(cost_input, 2),
        "cost_output_usd": round(cost_output, 2),
        "total_cost_usd": round(total_cost, 2),
        "cost_per_run": round(total_cost / workflow_runs, 4)
    }

# Example: 1000 workflow runs
result = calculate_workflow_costs(
    workflow_runs=1000,
    avg_input_tokens=2000,  # per agent
    avg_output_tokens=1500  # per agent
)

print(f"Total cost for 1000 runs: ${result['total_cost_usd']}")
print(f"Cost per run: ${result['cost_per_run']}")

Hands-On Exercise: Production Workflow

Prerequisites: Python 3.11+, Anthropic API key, basic understanding of async programming

Exercise Objective

Transform a prototype multi-agent workflow into a production-ready system with error handling, logging, and cost tracking.

Steps

  1. Error Handling (15 min) - Add try/except blocks, retry logic with exponential backoff, and timeout handling
  2. Structured Logging (10 min) - Implement JSON logging for all agent operations and workflow steps
  3. Cost Tracking (10 min) - Track input/output tokens and calculate costs per agent and per workflow
  4. Performance Monitoring (10 min) - Add timing metrics and success/failure tracking
  5. Testing (10 min) - Run 5 workflows with different topics and verify all metrics

Success Criteria

  • Workflow handles API timeouts gracefully (no crashes)
  • Retries 3 times with exponential backoff on failures
  • All operations logged in structured JSON format
  • Cost tracking accurate to 4 decimal places
  • Complete workflow metrics report generated

Test Scenarios

# Test your production workflow with these scenarios:

test_scenarios = [
    {
        "name": "Happy Path",
        "topic": "AI in healthcare diagnostics",
        "expected": "Successful completion with full report"
    },
    {
        "name": "Long Topic",
        "topic": "Comprehensive analysis of blockchain technology adoption in enterprise financial systems including security, scalability, and regulatory compliance",
        "expected": "Handle token limits gracefully"
    },
    {
        "name": "Rate Limit Simulation",
        "topic": "Run 10 workflows in parallel",
        "expected": "Retry logic handles rate limits"
    },
    {
        "name": "Cost Analysis",
        "topic": "Any topic",
        "expected": "Accurate cost breakdown per agent"
    }
]

# Verify metrics
required_metrics = [
    "total_input_tokens",
    "total_output_tokens",
    "total_cost_usd",
    "duration_seconds",
    "status",
    "agent_breakdown"
]

Common Issues & Solutions

Retry Logic Not Working
Issue: Retries not triggering on errors
Solution: Ensure you're catching the correct exception types (anthropic.APITimeoutError, anthropic.RateLimitError). Use exponential backoff: wait_time = retry_delay * (2 ** attempt). Log each retry attempt for debugging.
Cost Tracking Inaccurate
Issue: Costs don't match API usage
Solution: Verify you're using response.usage.input_tokens and response.usage.output_tokens (not estimating). Check pricing constants match current rates. Accumulate costs across all retries and agents.
Workflow Hangs Indefinitely
Issue: No timeout set on API calls
Solution: Always set timeout parameter in client.messages.create(timeout=30). Use asyncio.wait_for() for async calls. Implement circuit breaker pattern for repeated failures.

Extension Challenges

Bonus: Add Prometheus metrics export, implement workflow state persistence (resume failed workflows), create Grafana dashboard, or add circuit breaker pattern for cascading failures.

Session 1 Quiz: Production AI Workflows

15 minutes
Passing: 70%
Question 1 of 10 10 points
What percentage of organizations are stuck in the pilot phase with AI agents?
Correct! 52% of organizations struggle to move AI agents from prototype to production, highlighting the critical need for production-ready skills.
Question 2 of 10 10 points
What is the average annual savings from deploying production agent workflows?
Correct! Companies report $200K-400K in annual savings by automating knowledge work with multi-agent workflows.
Question 3 of 10 10 points
Which of the following is NOT a critical production requirement?
Correct! Production systems need error handling, logging, and monitoring. UI aesthetics are secondary to reliability and observability.
Question 4 of 10 10 points
What retry strategy should you use for transient API failures?
Correct! Exponential backoff prevents overwhelming the API during issues and gives time for recovery. Wait time doubles with each retry.
Question 5 of 10 10 points
What is the purpose of structured logging in production systems?
Correct! Structured JSON logs enable automated parsing, searching, and analysis by monitoring tools like ELK stack or Datadog.
Question 6 of 10 10 points
How are Claude 3.5 Sonnet API costs calculated?
Correct! Costs are calculated per token: $3/1M input tokens and $15/1M output tokens for Claude 3.5 Sonnet.
Question 7 of 10 10 points
What is the recommended default timeout for LLM API calls?
Correct! 30 seconds is a good balance - enough for complex requests but prevents hanging indefinitely.
Question 8 of 10 10 points
In the multi-agent workflow pattern, what is the typical flow?
Correct! The sequential pattern (research → analysis → writing) mirrors how humans work and produces higher quality results.
Question 9 of 10 10 points
What should you do when an agent step fails after max retries?
Correct! Graceful failure with detailed logging allows debugging and potential manual recovery without crashing the system.
Question 10 of 10 10 points
Which metric is MOST important for production monitoring?
Correct! Service Level Indicators (SLIs) like success rate, latency, and error rates directly measure user-facing reliability.

Homework Assignment: Production-Ready Workflow

Build a production-grade multi-agent workflow with enterprise-level reliability and monitoring.

Assignment Overview

Transform your Session 1 workflow into a production system that meets enterprise standards for error handling, observability, cost management, and recovery.

Tasks

  1. Part 1: Retry Logic with Exponential Backoff (25%)
    • Implement retry logic for all API calls (3 attempts)
    • Use exponential backoff: wait_time = 2 * (2 ** attempt)
    • Handle specific exceptions: TimeoutError, RateLimitError, APIError
    • Log each retry attempt with structured data
  2. Part 2: Structured Logging (JSON) (25%)
    • Implement JSON-formatted logging for all operations
    • Include: timestamp, event type, agent name, duration, status
    • Log workflow start, each agent step, and completion
    • Create log analysis script to extract metrics
  3. Part 3: Cost Tracking Dashboard (25%)
    • Track input/output tokens for each agent call
    • Calculate costs using Claude pricing ($3/$15 per 1M tokens)
    • Generate cost report by agent and by workflow
    • Create simple visualization (text-based or matplotlib)
  4. Part 4: Failure Recovery Workflow (25%)
    • Design workflow that can recover from agent failures
    • Implement state persistence (save/load workflow state)
    • Add fallback responses for failed agents
    • Test recovery by simulating failures

Deliverables Checklist

  • Python code with complete error handling and retry logic
  • Structured JSON logging throughout workflow
  • Cost tracking system with breakdown by agent
  • State persistence (save/resume workflows)
  • Test suite demonstrating failure recovery
  • README with architecture notes and usage examples

Grading Rubric

Component Points Criteria
Retry Logic 25 Exponential backoff, handles all error types, logged attempts
Structured Logging 25 JSON format, comprehensive coverage, analysis script
Cost Tracking 25 Accurate token counting, cost calculation, visual dashboard
Failure Recovery 25 State persistence, recovery logic, fallback handling

Submission Details

Due Date: 1 week from session date
Format: GitHub repository with README and examples
Include: Code, sample logs, cost reports, test results

Bonus Challenges (+10 points each)

  • Prometheus Metrics: Export metrics to Prometheus format for Grafana dashboards
  • Circuit Breaker: Implement circuit breaker pattern to prevent cascading failures
  • Async Execution: Parallelize independent agent operations using asyncio
  • Rate Limiting: Implement token bucket algorithm for API rate limit management

Session Resources

Documentation

Code Examples

  • 01-starter-agent-workflow.py
  • 01-solution-agent-workflow.py
  • cost_tracker.py
  • structured_logging.py

Tutorials & Articles

Community & Support