• Multi-agent collaboration is being used as a key AI design pattern for complex task management. This approach involves dividing a large task, such as software development, into smaller subtasks and assigning them to specialized agents, such as a software engineer, product manager, designer, and QA engineer. Each agent performs specific functions, potentially built using the same or different Large Language Models (LLMs).
  • The concept uses a programming abstraction similar to multi-threading in software development, where tasks are broken down to be handled more efficiently by different processors or threads.


  • The motivation for using multi-agent systems is threefold:
    1. Proven Effectiveness: Teams have reported positive results using this approach, and studies like those mentioned in the AutoGen paper have demonstrated that multi-agent systems can outperform single-agent systems in complex tasks.
    2. Optimized Task Handling: Despite the advancements in LLMs, such as the ability to process long input contexts (e.g., Gemini 1.5 Pro with 1 million tokens), focusing LLMs on specific, simpler tasks can yield better performance. This method allows developers to specify critical aspects of subtasks, improving the optimization of each component.
    3. Complex Task Decomposition: This design pattern provides a framework for developers to break down complex tasks into manageable subtasks, similar to traditional human resource management in companies. This not only simplifies the development process but also enhances the workflow and interaction among agents, who can have their own memory systems and engage in planning and tool use.


  • Despite their remarkable performance in various domains, quantifying and objectively evaluating LLM-based agents remain challenging. Several benchmarks have been designed to evaluate LLM agents. Some examples include:
    • AgentBench
    • IGLU
    • ClemBench
    • ToolBench
    • GentBench
    • MLAgentBench
  • Apart from task specific metrics, some dimensions in which agents can be evaluated include:
    • Utility: Focuses on task completion effectiveness and efficiency, with success rate and task outcomes being primary metrics.
    • Sociability: Includes language communication proficiency, cooperation, negotiation abilities, and role-playing capability.
    • Values: Ensures adherence to moral and ethical guidelines, honesty, harmlessness, and contextual appropriateness.
    • Ability to Evolve Continually: Considers continual learning, autotelic learning ability, and adaptability to new environments.
    • Adversarial Robustness: LLMs are susceptible to adversarial attacks, impacting their robustness. Traditional techniques like adversarial training are employed, along with human-in-the-loop supervision.
    • Trustworthiness: Calibration problems and biases in training data affect trustworthiness. Efforts are made to guide models to exhibit thought processes or explanations to enhance credibility.

Further Reading

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

  • This paper by Wu et al. from Microsoft Research, Pennsylvania State University, University of Washington, and Xidian University, introduces AutoGen, an open-source framework designed to facilitate the development of multi-agent large language model (LLM) applications. The framework allows the creation of customizable, conversable agents that can operate in various modes combining LLMs, human inputs, and tools.
  • AutoGen agents can be easily programmed using both natural language and computer code to define flexible conversation patterns for different applications. The framework supports hierarchical chat, joint chat, and other conversation patterns, enabling agents to converse and cooperate to solve tasks. The agents can hold multiple-turn conversations with other agents or solicit human inputs, enhancing their ability to solve complex tasks.
  • The figure below from the paper illustrates the fact that AutoGen enables diverse LLM-based applications using multi-agent conversations. (Left) AutoGen agents are conversable, customizable, and can be based on LLMs, tools, humans, or even a combination of them. (Top-middle) Agents can converse to solve tasks. (Right) They can form a chat, potentially with humans in the loop. (Bottom-middle) The framework supports flexible conversation patterns.

  • Key technical details include the design of conversable agents and conversation programming. Conversable agents can send and receive messages, maintain internal context, and be configured with various capabilities such as LLMs, human inputs, and tools. These agents can also be extended to include more custom behaviors. Conversation programming involves defining agent roles and capabilities and programming their interactions using a combination of natural and programming languages. This approach simplifies complex workflows into intuitive multi-agent conversations.
  • Implementation details:
    1. Conversable Agents: AutoGen provides a generic design for agents, enabling them to leverage LLMs, human inputs, tools, or a combination. The agents can autonomously hold conversations and solicit human inputs at certain stages. Developers can easily create specialized agents with different roles by configuring built-in capabilities and extending agent backends.
    2. Conversation Programming: AutoGen adopts a conversation programming paradigm to streamline LLM application workflows. This involves defining conversable agents and programming their interactions via conversation-centric computation and control. The framework supports various conversation patterns, including static and dynamic flows, allowing for flexible agent interactions.
    3. Unified Interfaces and Auto-Reply Mechanisms: Agents in AutoGen have unified interfaces for sending, receiving, and generating replies. An auto-reply mechanism enables conversation-driven control, where agents automatically generate and send replies based on received messages unless a termination condition is met. Custom reply functions can also be registered to define specific behavior patterns.
    4. Control Flow: AutoGen allows control over conversations using both natural language and programming languages. Natural language prompts guide LLM-backed agents, while Python code specifies conditions for human input, tool execution, and termination. This flexibility supports diverse multi-agent conversation patterns, including dynamic group chats managed by the GroupChatManager class.
  • The figure below from the paper illustrates how to use AutoGen to program a multi-agent conversation. The top subfigure illustrates the built-in agents provided by AutoGen, which have unified conversation interfaces and can be customized. The middle sub-figure shows an example of using AutoGen to develop a two-agent system with a custom reply function. The bottom sub-figure illustrates the resulting automated agent chat from the two-agent system during program execution.

  • The paper details the framework’s architecture, where agents are defined with specific roles and capabilities, interacting through structured conversations to process tasks efficiently. This approach improves task performance, reduces development effort, and enhances application flexibility. The significant technical aspects include using a unified interface for agent interaction, conversation-centric computation for defining agent behaviors, and conversation-driven control flows that manage the sequence of interactions among agents.
  • Applications demonstrate AutoGen’s capabilities in various domains, such as:
    • Math Problem Solving: AutoGen builds systems for autonomous and human-in-the-loop math problem solving, outperforming other approaches on the MATH dataset.
    • Retrieval-Augmented Code Generation and Question Answering: The framework enhances retrieval-augmented generation systems, improving performance on question-answering tasks through interactive retrieval mechanisms.
    • Decision Making in Text World Environments: AutoGen implements effective interactive decision-making applications using benchmarks like ALFWorld.
    • Multi-Agent Coding: The framework simplifies coding tasks by dividing responsibilities among agents, improving code safety and efficiency.
    • Dynamic Group Chat: AutoGen supports dynamic group chats, enabling collaborative problem-solving without predefined communication orders.
    • Conversational Chess: The framework creates engaging chess games with natural language interfaces, ensuring valid moves through a board agent.
  • The empirical results indicate that AutoGen significantly outperforms existing single-agent and some multi-agent systems in complex task environments by effectively integrating and managing multiple agents’ capabilities. The paper includes a figure illustrating the use of AutoGen to program a multi-agent conversation, showing built-in agents, a two-agent system with a custom reply function, and the resulting automated agent chat.
  • The authors highlight the potential for AutoGen to improve LLM applications by reducing development effort, enhancing performance, and enabling innovative uses of LLMs. Future work will explore optimal multi-agent workflows, agent capabilities, scaling, safety, and human involvement in multi-agent conversations. The open-source library invites contributions from the broader community to further develop and refine AutoGen.



If you found our work useful, please cite it as:

  title   = {Agents},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{}}