AI Agents in Practice

Building Real Products with Autonomous Agent Teams

MIT GenAI Global · March 23, 2026

The Agent Spectrum

Level	What It Does	Where Most Orgs Are
Chatbot	Answers questions from a script	✓ Here
Copilot	Suggests, drafts, assists a human	✓ Getting here
Autonomous Agent	Completes tasks independently	⚡ Experimenting
Agent Teams	Multiple agents coordinate on complex work	✗ Almost nobody

The Personal AI Trajectory

💬

Yesterday

Chat in a browser tab.
You go to it.
Stateless. Forgets everything.

📱

Now

Always-on in your pocket.
Persistent memory.
Knows your context across channels.

🧠

🌐

Endgame

Manages your entire digital life.
Star Trek: amplifies you.
Wall-E: replaces you.
Same technology. Different design choices.

Now that AI can think — do we let it do all our thinking, or use it like they do at Starfleet Academy to amplify our learning and impact?

Let's Build

☄️

NASA NEO Dashboard

Live asteroid data from NASA API. Interactive charts. Deterministic output.

👋

Onboarding System

Portal + live Mattermost channel. New hire Q&A agent.

🔥

Incident Postmortem

Structured report + incident channel. Ask about the outage.

14

Agents

3

Products

0

Lines Written by Humans

The AI is the engineer,
not the engine.

Build deterministic systems with nondeterministic tools.

Deterministic vs Nondeterministic

AI Builds the System

Code calls real APIs
Same input, same output. Every time.
You can audit it. You can test it.
Example: NASA asteroid dashboard

AI Runs the System

LLM answers questions at runtime
Different answer each time
Harder to audit. Harder to trust.
Example: Q&A channel from docs

The Agent Team Patterns

How It Works

Layer 1 — Personal AI Layer

OpenClaw messaging, memory, tools, channels

↓

Layer 2 — Coding Orchestration

Claude Code parallel sub-agents, file I/O, verification

↓

Layer 3 — Complex Orchestration

n8n / Custom Engines DAGs, governance, retry logic

Spec-Driven Development

The Promise

Write the perfect spec upfront
Agents execute flawlessly
Reproducible — same prompt, same result
Scalable — run it 1,000 times

The Reality

You can't spec what you haven't built
Building reveals what the spec missed
Human taste can't be encoded in text
The spec is already outdated by the time you finish writing it

This is waterfall. The spec takes longer than the build.

The product can be completed faster
than the meeting that talks about it.

Spec → Build → Review

weeks of meetings to write the spec

→

Build → Review → Iterate (or Throw Away)

minutes to build, then decide if it's worth keeping

Build v1. Look at it. Throw it away or iterate. The spec writes itself after you've built something real.

If you wouldn't let an unsupervised intern do it,
don't let an unsupervised agent do it.

Scope their work. Review their output. Limit their blast radius.

The Honest Take

What Works

Building self-contained tools fast
Parallel agent teams for distinct tasks
Agents that remember you across sessions
Verification agents catch real bugs before you see them

What Doesn't (Yet)

Codebase changes constantly. Things break between versions.
Agents say yes to everything, then don't follow through
Context disappears between channels
Can't handle real multi-agent orchestration

The "Yes Man" Problem

"Sure, I can schedule that for 9 AM!" — it didn't
"I'll coordinate all 8 agents in parallel!" — 3 of them silently failed
"I'll monitor that and alert you!" — it forgot 10 minutes later
"I've verified the output is correct!" — it checked its own work

Separate verifier. Checkpoints. Humans in the loop for anything that matters.

The Opinionated Framework Problem

Search: Brave forced as default — we built a workaround
Coding agents: Codex preferred — even with Claude Code configured
Image generation: OpenAI assumed — no native alternative
Model ecosystem: Pulls toward specific providers — friction with Azure/Google/local

Opinionated = fast start, hard to customize. Unopinionated = slow start, full flexibility. Pick one.

Security & Trust

Your agent can read your messages, files, and calendar. How much of that does it actually need?
API keys pass through every integration. Each one is an attack surface.
Agents send emails, post to channels, create infrastructure. You need approval gates on external actions.
Private context from one session can surface in a group chat. We've seen it happen.

More autonomy = more risk. Design for the worst case.

Real Numbers

~12

Minutes per product

14

Total agents

3

Verified products

A team meeting to discuss building these would take longer than actually building them.

Each product: built, integrated, and QA-verified by independent agent
All three ran in parallel. Total wall time: about 14 minutes.
Verifier agents caught and fixed real bugs before delivery

The Results

While we were talking, 14 agents built 3 products.
Let's see what they made.

☄️

NASA NEO Dashboard

✓ Built · ✓ Verified · ✓ Live data

👋

Onboarding System

✓ Built · ✓ Verified · ✓ Channel live

🔥

Incident Postmortem

✓ Built · ✓ Verified · ✓ Channel live

Know Your Tools

Need	Tool	Why
Personal AI layer	OpenClaw	Messaging, memory, channels, daily workflows
Build software fast	Claude Code / Codex	Parallel sub-agents, file I/O, great at code
Workflow automation	n8n / Make / Zapier	Visual flows, webhooks, reliable triggers
Complex orchestration	LangGraph / Custom	DAGs, state machines, governance, retry
Swarm intelligence	Custom engines	Self-improving, parallel research, deep orchestration

Start Small, Scale Smart

Week 1

Crawl

Single agent, one repetitive task
Report generation, data formatting

→

Month 1

Walk

Agent + verifier, human review
Internal tools, dashboards

→

Quarter 1

Run

Agent teams, parallel builds
Multi-surface integration

Pick the most repetitive workflow first — not the hardest one.

What's Coming

Beware the "more context = better" trap. ETH Zurich found that CLAUDE.md/AGENTS.md files reduce task success rates while increasing cost 20%+. LLM-generated ones hurt the most — they repeat what's already in the code. (arXiv:2602.11988)
The "have it rewrite itself" pattern is a token furnace. Agents told to self-update their own instruction files trigger broader exploration without better outcomes. An 800-token human-written skills file outperforms a 15k auto-generated one. (Scalekit 2026 benchmark)
MCP adds 10-32× token overhead vs CLI for the same tasks, with a 28% failure rate on top. A good REST API + a short skills file still beats a protocol wrapper with 43 tool schemas. (Scalekit 2026: $3.20/mo CLI vs $55.20/mo MCP at 10k ops)
Local models are getting good enough to run agents on your own hardware. No API costs, no data leaving your network.
Memory is the big unsolved problem. Agents forget between sessions. Community reports show 80k+ tokens wasted per repeated mistake after context resets. (anthropics/claude-code #13579)

What Your Org Should Do Monday

Figure out where you actually are. Most orgs think they're further along than they are.
Pick one boring, repetitive workflow. Automate that. Not the interesting problem.
Build something instead of speccing it. You'll learn more from a bad v1 than a perfect requirements doc.
Add a verifier. Seriously. Don't ship agent output unchecked.

Discussion

What are you building? What's broken?
Let's talk.

GitHub: github.com/GixGosu · @brineshrimp
linkedin.com/in/joshua-burdick-25a993180