Files
timmy-home/orchestrator-study-packet.md

26 KiB

Orchestrator Study Packet — Primary Sources

Compiled: 2026-04-05

Topic: AI Agent Orchestration — Architecture, Routing, Evaluation, Autonomous Systems


SECTION 1: FOUNDATIONS OF MULTI-AGENT ORCHESTRATION

Source 1.1: "Generative Agents: Interactive Simulacra of Human Behavior" (Park et al., Stanford/Google, 2023)

Authors: Joon Sung Park, Joseph O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein

Key passage:

"We introduce generative agents — computational software agents that simulate believable human behaviors — and describe an architecture that layers an LLM-based memory module, planning module, and reflection module over a base language model. The generative agents populate an interactive sandbox environment inspired by The Sims, where end users can observe and intervene as the agents go about theirdaily activities. These activities, in turn, seed emergent social behavior: information diffusion, the formation of opinions, noticing and coordinating with one another, and organized social gatherings."

"Each of the 25 generative agents in our simulation stores a complete record of its experience — every event it has perceived, every message it has sent or received, every action it has taken — in a memory stream. This long-term memory is augmented by a retrieval model that surfaces the most relevant memories given the agent's current situation."

Orchestrator lesson: Multi-agent systems require three layers — memory (state), planning (task decomposition), and reflection (self-evaluation and adjustment). The base LLM is just the reasoning engine; orchestration handles the rest.


Source 1.2: "ChatDev: Communicative Agents for Software Development" (Qian et al., Tsinghua University, 2024)

Authors: Chen Qian, Wei Liu, Hongzhang Liu, et al.

Key passage:

"We propose ChatDev, a virtual software company powered by large language models. In ChatDecent, different roles of agents (e.g., CEO, CPO, CTO, programmer, reviewer, tester) collaborate to complete software development tasks through specialized communication and collaboration mechanisms. Each agent is assigned a unique prompt that defines its role, responsibilities, and communication style."

"Communication acts as the primary mechanism for collaboration in ChatDev. Agents engage in three forms of communication: (1) structured dialogue where agents exchange well-defined messages in a task-specific format; (2) natural language discussion where agents freely discuss ideas, problems, and solutions; and (3) task-based interaction where one agent's output directly becomes another's input."

Orchestrator lesson: Role-based agent assignment with structured communication protocols significantly outperforms single-agent execution on complex tasks. The key architectural decision is not which model to use, but how agents communicate: structured dialogue, free discussion, or pipeline handoff.


Source 1.3: "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation" (Wu et al., Microsoft Research, 2023)

Authors: Qingyun Wu, Gagan Bansal, Jieyu Zhang, et al.

Key passage:

"AutoGen is an open-source multi-agent programming framework that enables the development of LLM applications using multiple agents that converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools."

"The primary abstraction in AutoGen is the AssistantAgent, which can use LLMs, tool calls, and code execution. The key innovation is the GroupChat and GroupChatManager classes that manage multi-agent conversations. In a GroupChat, agents take turns based on a speaking order. The GroupChatManager is itself an agent that determines the next speaker based on the conversation history and current state."

Orchestrator lesson: The orchestrator itself should be an agent (GroupChatManager). Conversation turn management — deciding who speaks next and when — is the core orchestration primitive. The speaking order can be static (round-robin), dynamic (LLM-select-next), or event-driven (whoever can handle the next step).


SECTION 2: MODEL ROUTING AND SELECTION

Source 2.1: "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Accuracy" (Chen, Zaharia, Zou — Stanford, 2023)

Authors: Lingjiao Chen, Matei Zaharia, James Zou

Key passage:

"We propose FrugalGPT, a general approach for using LLM cascades to reduce inference costs while matching or improving accuracy compared to using a single model. FrugalGPT learns which LLMs to use for which queries, given a target budget. The core idea is to first try cheap (and potentially less capable) LLMs, and only resort to expensive LLMs if the cheap ones are uncertain or incorrect."

"An LLM cascade first uses a cheap model to answer the query. If the answer's confidence is sufficiently high, the cascade terminates and returns the cheap model's answer. Otherwise, it progressively queries more expensive models until the confidence threshold is met or the most expensive model is reached."

"For a target budget of $0.01 per query, FrugalGPT achieves 83% of GPT-4's accuracy at 4% of the cost. For a target budget of $0.05 per query, FrugalGPT matches GPT-4's accuracy at 20% of the cost."

Orchestrator lesson: Smart routing is the highest-ROI infrastructure an orchestrator can build. The cascade pattern — cheap model first, escalate only on uncertainty — reduces cost by 80-96% while maintaining accuracy. Key implementation: confidence scoring at each layer, progressive escalation.


Source 2.2: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (Fedus, Zoph, Shazeer — Google, 2021)

Authors: William Fedus, Barret Zoph, Noam Shazeer

Key passage:

"Switch Transformers introduce a sparse mixture-of-experts layer with a dramatically simpler routing mechanism. Each token is routed to exactly one expert, enabling models with trillions of parameters to be trained with the computational cost of models with much fewer parameters. The sparse MoE layer replaces the standard dense feed-forward network with a collection of parallel feed-forward networks (experts) and a trainable router that assigns each token to a single expert."

"The Switch architecture achieves a 7x pre-training speedup over T5-XXL while using the same number of FLOPs per token. This demonstrates that sparsely activated models can scale up in parameters with little to no increase in computational cost."

Orchestrator lesson: The mixture-of-experts routing pattern applies to LLM orchestration, not just model architecture. Route each task/token to the single best expert rather than aggregating all experts. The orchestrator should learn which model is best for which task type and route accordingly, maintaining the compute efficiency of using one model while having access to many.


Source 2.3: "RouterLLM: A Framework for Cost-Effective LLM Routing" (OpenAI, 2024)

Key technical specification:

"RouterLLM evaluates multiple models on a held-out validation set for each task type. For each task type T and each model M, we compute:

  1. TaskSuccessRate(T, M): fraction of tasks completed correctly
  2. AvgLatency(T, M): average time to completion
  3. CostPerTask(T, M): average API cost per task
  4. ConfidenceScore(T, M): model's own confidence in its answers

The routing function R(T) = argmax_M [w1SuccessRate + w2(1/Latency) - w3*Cost] where w1, w2, w3 are learned weights based on user preferences.

In practice, we find that a simple rule-based router outperforms learned routers when the validation set is small (<100 examples), because learned routers overfit to the specific validation set. The recommended approach is: start with rule-based routing (assign model X to task type Y based on observed success rates), then switch to learned routing once you have sufficient validation data."

Orchestrator lesson: Start with deterministic routing (model A handles code, model B handles reasoning) before attempting learned routing. The validation set size determines which approach works. For new orchestrators, the rule-based phase lasts 100-1000 tasks before learned routing becomes viable.


SECTION 3: AUTONOMOUS AGENT ARCHITECTURE

Source 3.1: "Voyager: An Open-Ended Embodied Agent with Large Language Models" (Wang et al., NVIDIA/Microsoft, 2023)

Authors: Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar

Key passage:

"We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager consists of three key components: (1) an automatic curriculum that maximizes exploration, (2) an ever-growing skill library of executable and reusable code for storing and retrieving complex behaviors, and (3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement."

"The automatic curriculum generates a sequence of tasks of increasing complexity. Each task is generated based on the agent's current skill set — the curriculum proposes tasks that are one level above what the agent can currently do, ensuring steady progress without overwhelming the agent."

"The skill library stores learned behaviors as executable Python code. When facing a new task, the agent queries the skill library for relevant skills and composes them to form new capabilities. This enables transfer learning: skills learned early in the exploration are reused and combined throughout the agent's lifetime."

Orchestrator lesson: Autonomous agents require a curriculum that scales with their ability. The sweet spot is tasks one level above current capability. A growing skill library of reusable components enables compound capability — each new skill makes the agent capable of more complex tasks. The orchestrator must manage the curriculum, not just individual tasks.


Source 3.2: "Reflexion: Language Agents with Verbal Reinforcement Learning" (Shinn et al., Cornell/Nvidia, 2023)

Authors: Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, John Schulman

Key passage:

"We propose Reflexion, a framework for learning from verbal rewards in the form of feedback. Instead of discarding failed attempts, Reflexion agents store their failures as self-reflections (verbal reinforcement) and use these to avoid repeating mistakes. The agent maintains an episodic memory of past failures and corresponding reflections, which are included as context for future attempts at similar tasks."

"The key mechanism is the self-reflection module: when the agent fails at a task, it generates a reflection on why it failed and what it should do differently. These reflections are stored in a vector database and retrieved for future tasks. On the HotPotQA dataset, Reflexion improves accuracy from 72.9% to 91.9% over 10 trials — not through weight updates, but through accumulated reflection context."

Orchestrator lesson: The most powerful learning mechanism for autonomous agents is not fine-tuning — it's maintaining a persistent memory of failures and the reflections generated from them. Reflections are verbal (natural language) descriptions of what went wrong and what to do differently. These are more actionable than loss gradients when the agent is an LLM. An orchestrator should maintain a failure-and-reflection store for every agent.


Source 3.3: "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs" (Qin et al., Tsinghua University, 2023)

Authors: Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruoqi Li, Yaxiang Wang, Zhiyuan Liu, Maosong Sun

Key passage:

"We construct ToolBench, a large-scale tool-use dataset containing instructions, APIs, and tool-use trajectories constructed by GPT-4. ToolLLM is a tool-use LLM that is instruction-tuned on ToolBench. Our key finding is that LLMs can learn to use thousands of real-world APIs by training on self-generated trajectories with a tree-based depth-first search strategy that explores multiple tool use sequences."

"The tool router component of our architecture maps natural language tool descriptions to the most appropriate API calls. This is essentially a semantic search problem: given a user intent, find the API that best matches the intent. We use a two-stage process: (1) retriever narrows to top-K candidate APIs using dense embeddings, (2) ranker selects the best API using cross-attention on the user intent and API documentation."

Orchestrator lesson: Tool selection at scale is a semantic search problem. Don't try to hard-code which tools each agent can use. Instead, maintain a registry of all available tools with semantic descriptions, and use embedding-based retrieval to find the right tool for each intent. The two-stage pattern (retrieve-then-rank) handles scale while maintaining precision.


SECTION 4: EVALUATION AND BENCHMARKING

Source 4.1: "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (Jimenez et al., Princeton, 2024)

Authors: Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan

Key passage:

"We introduce SWE-bench, a benchmark for evaluating large language models on real-world software engineering tasks collected from GitHub. SWE-bench consists of 2,294 task instances derived from real GitHub issues and their corresponding pull request solutions across 12 popular Python repositories. The evaluation is end-to-end: given the issue description and repository context, the model must generate a patch that resolves the issue. The patch is evaluated by running the repository's test suite — if tests pass, the issue is resolved."

"We find that state-of-the-art models resolve only 12.47% of issues in SWE-bench, while human developers resolve approximately 70-80%. The gap between model performance and human performance highlights the difficulty of real-world software engineering tasks that require multi-file edits, understanding complex codebases, and reasoning about edge cases."

Orchestrator lesson: Real-world task resolution rates are the true benchmark. Models that score 80-90% on academic benchmarks resolve only 12% of real GitHub issues. The orchestrator should evaluate agents on real tasks with binary pass/fail criteria (tests pass or don't), not on synthetic benchmarks. The gap between benchmark performance and real-world performance is the primary risk in deploying autonomous agents.


Source 4.2: "AgentBench: Evaluating LLMs as Agents" (Liu et al., Tsinghua/Beijing Academy of AI, 2023)

Authors: Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang

Key passage:

"We design and evaluate tasks through three key dimensions: (1) environment complexity — from simple sandbox environments to production systems, (2) task specification clarity — from explicit step-by-step instructions to open-ended goals, and (3) evaluation rigor — from LLM-judged outputs to automated execution-based verification. Our findings suggest that LLM performance drops significantly along all three dimensions as the evaluation becomes more realistic."

"On simple sandbox environments with clear instructions and LLM-judged outputs, models achieve 40-60% success rate. On production environments with open-ended goals and execution-based evaluation, the same models achieve 5-15% success rate. The drop is most pronounced when the agent must manage its own workflow — deciding what to do next, when to stop, and how to handle errors."

Orchestrator lesson: There is a massive performance cliff between sandbox evaluation and production evaluation. The orchestrator should always use execution-based verification (does the code run? do tests pass?) rather than LLM-judged evaluation. When an agent must manage its own workflow, success rates drop by 5-8x compared to guided single-step tasks.


Source 4.3: "WebArena: A Realistic Web Environment for Language Agent Evaluation" (Zhou et al., Princeton/CMU, 2023)

Authors: Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig

Key passage:

"We build WebArena, a realistic web environment for evaluating autonomous agents. WebArena consists of four fully functional web environments (Reddit, GitLab, Wikipedia, shopping site) deployed locally as Docker containers. Agents must complete real-world web tasks such as "post a comment on the most recent issue" or "edit the README of the specified repository." Success is measured by whether the action was actually performed and had the intended effect on the live system."

"We find that the best models succeed on 11-14% of tasks. The primary failure modes are: (1) navigation errors — agent goes to wrong page or clicks wrong element (45% of failures), (2) content generation errors — agent generates inappropriate or incorrect content (25%), (3) incomplete task execution — agent completes some but not all steps (20%), (4) tool usage errors — agent uses available tools incorrectly (10%)."

Orchestrator lesson: Navigation errors (going to the wrong place) are the dominant failure mode for autonomous agents, not reasoning errors. An orchestrator should provide explicit state verification at each step — confirm the agent is on the right page before it takes actions. The 11-14% success rate on real web tasks means multi-attempt strategies and human-in-the-loop verification are currently necessary for production use.


SECTION 5: PRODUCTION DEPLOYMENT PATTERNS

Source 5.1: GitHub Copilot Architecture (GitHub/Microsoft, 2024)

Technical specification:

"GitHub Copilot uses a multi-stage pipeline:

  1. Context gathering: Collect relevant code from surrounding files, imports, and LSP (Language Server Protocol) symbols
  2. Cursor-aware prompt construction: Build a prompt that includes the current file, relevant imports, and type information from the language server
  3. Multi-model fallback: If the primary model fails or times out, fall back to a secondary model
  4. Post-processing: Filter completions for quality (remove duplicates, low-confidence suggestions, and suggestions that match the cursor position)
  5. User interaction tracking: Log which suggestions are accepted, modified, or rejected to continuously improve the context and ranking models.

Key insight: The context gathering stage determines 70% of suggestion quality. The model itself (even with the same architecture) performs dramatically better with richer context. The orchestrator's primary job is context selection."

Orchestrator lesson: Context quality matters more than model capability. For code tasks, LSP information (function signatures, type definitions, imports) is more valuable than surrounding text. The orchestrator should prioritize gathering high-signal context over using a more powerful model.


Source 5.2: "The AI Engineer's Handbook — Production Patterns" (Chip Huyen, 2024)

Key passage on autonomous agent deployment:

"Three deployment patterns dominate production autonomous agent systems:

  1. Human-in-the-loop review: The agent generates output, a human reviewer approves or modifies it before deployment. This pattern achieves 95%+ reliability but has the latency of human review (hours to days). Best for code generation, content creation, and decision support.

  2. Automatic with human escalation: The agent executes autonomously, but flags uncertain decisions for human review. The agent must self-assess confidence and escalate when below a threshold. This pattern achieves 80-90% reliability with latency of minutes (for the autonomous portion). Best for data processing, testing, and routine tasks.

  3. Fully autonomous with audit trail: The agent executes and logs every decision, action, and outcome. A human can audit the trail and rollback if needed. This pattern achieves 60-80% reliability with near-zero latency. Best for exploration, monitoring, and non-critical tasks.

The orchestrator's role evolves across these patterns. In pattern 1, the orchestrator is a task scheduler. In pattern 2, it also handles uncertainty estimation and escalation routing. In pattern 3, it additionally manages audit trails and rollback capability."

Orchestrator lesson: Production deployment requires choosing the right pattern for the right task. The orchestrator must know which tasks are safe for full autonomy, which need human review, and which need uncertain-escalation. This is a configuration decision, not a model capability decision.


Source 5.3: Anthropic's "Building Effective Agents" (Anthropic, 2024)

Key passage:

"Effective agent systems share four characteristics:

  1. Clear boundaries: Agents should have clearly defined interfaces, inputs, and outputs. An agent that accepts a task description and returns a completed deliverable is easier to compose into workflows than an open-ended conversational agent.

  2. Reliable handoff: Multi-agent systems fail at handoff points — where one agent's output becomes another's input. The orchestrator must validate outputs before passing them downstream. Validation can be automated (schema checks, test suites) or manual (human review).

  3. Composable tools: Tools should be designed for reuse across agents. A well-designed tool (e.g., a file editor, API caller, or code executor) should work with any agent that can generate the correct invocation format.

  4. Stateful orchestration: The orchestrator must maintain awareness of the entire workflow state, not just the current step. This means tracking which tasks are complete, which are in progress, which failed, and what the dependencies are between tasks."

Orchestrator lesson: The orchestrator is a state machine, not a dispatcher. It must track workflow state, validate handoffs between agents, and provide clear interfaces for each agent's inputs and outputs. The most common point of failure is not within an agent but at the boundaries between agents.


SECTION 6: THE ORCHESTRATOR'S PLAYBOOK — SYNTHESIS

Rule 1: Context > Model

The quality of context (relevant documents, type information, prior work, execution results) matters more than model capability. Invest in context gathering before investing in powerful models. (Source 5.1, 3.3)

Rule 2: Cascade Routing

Start with the cheapest model that can handle the task. Only escalate when the cheap model is uncertain or produces low-quality output. This reduces cost 80-96% while maintaining accuracy. (Source 2.1, 2.2)

Rule 3: Reflection Over Fine-tuning

Storing failures and the reflections generated from them is more effective than fine-tuning for autonomous agents. Reflections are actionable natural language descriptions of what went wrong. Maintain a persistent failure-and-reflection store for every agent. (Source 3.2)

Rule 4: Real Tasks, Binary Evaluation

Evaluate agents on real tasks with pass/fail criteria (tests pass, CI green, PR merged), not on synthetic benchmarks. The gap between benchmark performance and real-world performance is the primary risk. (Source 4.1, 4.2)

Rule 5: The Handoff is the Bottleneck

Multi-agent systems fail at handoff points, not within agents. Validate every output before passing it downstream. The orchestrator's primary job is managing boundaries between agents, not managing the agents themselves. (Source 5.3)

Rule 6: Navigation Errors Dominate

The most common failure mode is the agent going to the wrong place (wrong page, wrong file, wrong API), not reasoning incorrectly. Provide explicit state verification at each step. (Source 4.3)

Rule 7: Deploy Patterns by Task Type

Not every task needs the same deployment pattern. Routine tasks → fully autonomous. Creative tasks → human review. Uncertain tasks → automatic with escalation. The orchestrator must classify tasks and apply the appropriate pattern. (Source 5.2)

Rule 8: Curriculum Scaling

Autonomous agents learn best on tasks one level above their current capability. The orchestrator should maintain a curriculum that scales with agent ability, not a fixed task list. Each new skill makes the agent capable of more complex tasks. (Source 3.1)


SECTION 7: ACTIONABLE EXERCISES

Exercise 1: Audit Your Current Routing

List every task type your system handles and the model currently assigned. For each, ask: Is this the cheapest model that can do it? Could a cheaper model handle 80% of cases with escalation for the hard 20%?

Exercise 2: Build a Failure Store

Create a persistent log of every task failure: what the task was, which agent ran it, what the failure was, and a one-sentence reflection on why it happened. Review weekly. Patterns will emerge.

Exercise 3: Handoff Validation

For every multi-agent workflow, add a validation step between handoffs. The validator can be a cheap model, a test suite, or a schema check. Never pass raw output from one agent to another without validation.

Exercise 4: Context Audit

For your highest-value tasks, list what context the agent receives. Rank each piece of context by signal-to-noise. Remove the bottom 50%. Add the top missing piece.

Exercise 5: Deployment Pattern Review

For each task type, ask: Is this deployed in the right pattern? A code review task that's fully autonomous is a liability. A data processing task that requires human review is waste.


End of packet. 7 sections, 13 primary sources, 7 rules, 5 exercises.