• Multi-agent collaboration is being used as a key AI design pattern for complex task management. This approach involves dividing a large task, such as software development, into smaller subtasks and assigning them to specialized agents, such as a software engineer, product manager, designer, and QA engineer. Each agent performs specific functions, potentially built using the same or different Large Language Models (LLMs).
  • The concept uses a programming abstraction similar to multi-threading in software development, where tasks are broken down to be handled more efficiently by different processors or threads.


  • The motivation for using multi-agent systems is threefold:
    1. Proven Effectiveness: Teams have reported positive results using this approach, and studies like those mentioned in the AutoGen paper have demonstrated that multi-agent systems can outperform single-agent systems in complex tasks.
    2. Optimized Task Handling: Despite the advancements in LLMs, such as the ability to process long input contexts (e.g., Gemini 1.5 Pro with 1 million tokens), focusing LLMs on specific, simpler tasks can yield better performance. This method allows developers to specify critical aspects of subtasks, improving the optimization of each component.
    3. Complex Task Decomposition: This design pattern provides a framework for developers to break down complex tasks into manageable subtasks, similar to traditional human resource management in companies. This not only simplifies the development process but also enhances the workflow and interaction among agents, who can have their own memory systems and engage in planning and tool use.


  • Despite their remarkable performance in various domains, quantifying and objectively evaluating LLM-based agents remain challenging. Several benchmarks have been designed to evaluate LLM agents. Some examples include:

AgentBench IGLU ClemBench ToolBench GentBench MLAgentBench

  • Apart from task specific metrics, some dimensions in which agents can be evaluated include Utility: Focuses on task completion effectiveness and efficiency, with success rate and task outcomes being primary metrics. Sociability: Includes language communication proficiency, cooperation, negotiation abilities, and role-playing capability. Values: Ensures adherence to moral and ethical guidelines, honesty, harmlessness, and contextual appropriateness. Ability to Evolve Continually: Considers continual learning, autotelic learning ability, and adaptability to new environments. Adversarial Robustness: LLMs are susceptible to adversarial attacks, impacting their robustness. Traditional techniques like adversarial training are employed, along with human-in-the-loop supervision. Trustworthiness: Calibration problems and biases in training data affect trustworthiness. Efforts are made to guide models to exhibit thought processes or explanations to enhance credibility.

Further Reading

Selected Papers



If you found our work useful, please cite it as:

  title   = {Agents},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{}}