Overview

  • AI agents are autonomous systems designed to perform tasks by making decisions based on their environment and inputs. These decisions are typically made using AI techniques such as machine learning, natural language processing and can incorporate multiple modalities.
  • AI agents can be proactive and reactive, meaning they can initiate actions on their own and respond to changes in their environment. Their functionality is often complex and involves a degree of learning or adaptation to new situations
  • These tasks are determined by the AI itself based on the data it gathers and processes, making AI agents essential tools for efficiency and automation in various sectors.
  • AI agents distinguish themselves from ordinary software by their ability to make rational decisions. They process data received from their environments, whether through physical sensors or digital inputs, and use this information to predict and execute actions that align with set goals. This could range from a chatbot handling customer inquiries to a self-driving car navigating obstacles on the road.

“While there isn’t a widely accepted definition for LLM-powered agents, they can be described as a system that can use an LLM to reason through a problem, create a plan to solve the problem, and execute the plan with the help of a set of tools.” source

Core Components of AI Agents

  • The image above (source) simplifies the architecture of a traditional end to end agent pipeline.
  • Let’s dive deeper into each component of AI agents to understand their structure and functionality at a more detailed, technical level.

Agent Core (LLM)

  • Decision-Making Engine: Analyzes data from memory and inputs to make informed decisions.
  • Goal Management System: Maintains and updates the goals of the AI agent.
  • Integration Bus: Facilitates communication between memory modules, planning module, and tools.

Memory Modules

  • Short-term Memory (STM):
    • Data Structure: Implemented using stack, queue, or temporary databases for fast access and modification.
    • Volatility: Data in STM is transient and systematically cleared to free up space and processing power.
    • Functionality: Crucial for tasks requiring immediate but temporary recall.
  • Long-term Memory (LTM):
    • Data Storage: More permanent data storage solutions ensure data persistence.
    • Indexing and Retrieval Systems: Sophisticated indexing mechanisms facilitate quick retrieval of relevant information.
    • Learning and Updating Mechanisms: Updates stored data based on new information and learning outcomes.

Tools

  • Executable Workflows: Scripted actions or processes defined in a high-level language for specific tasks.
  • APIs: External and internal APIs for secure and efficient communication and modular design.
  • Middleware: Bridges the agent’s core logic and tools, handling data formatting, error handling, and security checks.

Example Flow Chart for an LLM Agent: Handling a Customer Inquiry

  • The image above (source) shows an example of AI Agent flow.
  1. Customer Interaction
    • Input: “Is the new XYZ smartphone available, and what are its features?”
    • Action: Customer types the query into the e-commerce platform’s chat interface.
  2. Query Reception and Parsing
    • Agent Core Reception: Receive text input.
    • Natural Language Understanding: Parse the text to extract intent and relevant entities.
  3. Intent Classification and Information Retrieval
    • Intent Classification: Classify the query intent.
    • Memory Access: Retrieve stored data on product inventory and specifications.
    • External API Calls: Fetch additional data if not available in memory.
  4. Data Processing and Response Planning
    • Planning Module: Split the query into “check availability” and “retrieve features”.
    • Data Synthesis: Combine information from memory.

Use Cases

  • Let’s look at a few agent use cases below:

Data Agent for Data Analysis

  • The image above (source) explains the flow we will use below.
  1. Identify the Use Case:
    • Define specific data analysis tasks, such as querying databases or analyzing financial reports.
  2. Select the Appropriate LLM:
    • Choose an LLM that handles the complexity of data queries and analysis.
  3. Agent Components:
    • Develop the agent with tools for data handling, a memory module for tracking interactions, and a planning module for strategic execution of tasks.
  4. Design the Data Interaction Tools:
    • Implement tools for interacting with databases or other data sources.

Tools Setup

class SQLExecutor:
    def __init__(self, database_url):
        self.database_url = database_url

    def execute_query(self, query):
        print(f"Executing SQL query: {query}")
        return "Query results"

class Calculator:
    @staticmethod
    def perform_calculation(data):
        print(f"Performing calculation on data: {data}")
        return "Calculation results"

Agent Core Logic

class DataAgent:
    def __init__(self, sql_executor, calculator):
        self.sql_executor = sql_executor
        self.calculator = calculator
        self.memory = []

    def analyze_data(self, query, calculation_needed=True):
        results = self.sql_executor.execute_query(query)
        self.memory.append(results)

        if calculation_needed:
            calculation_results = self.calculator.perform_calculation(results)
            self.memory.append(calculation_results)
            return calculation_results
        
        return results

database_url = "your_database_url_here"
sql_executor = SQLExecutor(database_url)
calculator = Calculator()

agent = DataAgent(sql_executor, calculator)
query = "SELECT * FROM sales_data WHERE year = 2021"
print(agent.analyze_data(query))

LLM-Powered API Agent for Task Execution

  1. Choose an LLM:
    • Select a suitable LLM for handling task execution.
  2. Select a Use Case:
    • Define the tasks the agent will execute.
  3. Build the Agent:
    • Develop the components required for the API agent: tools, planning module, and agent core.
  4. Define API Functions:
    • Create classes for each API call to the models.

Python Code Example

class ImageGenerator:
    def __init__(self, api_key):
        self.api_key = api_key

    def generate_image(self, description, negative_prompt=""):
        print(f"Generating image with description: {description}")
        return "Image URL or data"

class TextGenerator:
    def __init__(self, api_key):
        self.api_key = api_key

    def generate_text(self, text_prompt):
        print(f"Generating text with prompt: {text_prompt}")
        return "Generated text"

class CodeGenerator:
    def __init__(self, api_key):
        self.api_key = api_key

    def generate_code(self, problem_description):
        print(f"Generating code for: {problem_description}")
        return "Generated code"

Plan-and-Execute Approach

def plan_and_execute(question):
    if 'marketing' in question:
        plan = [
            {
                "function": "ImageGenerator",
                "arguments": {
                    "description": "A bright and clean laundry room with a large bottle of WishyWash detergent, featuring the new UltraClean formula and softener, placed prominently.",
                    "negative_prompt": "No clutter, no other brands, only WishyWash."
                }
            },
            {
                "function": "TextGenerator",
                "arguments": {
                    "text_prompt": "Compose a tweet to promote the new WishyWash detergent with the UltraClean formula and softener at $4.99. Highlight its benefits and competitive pricing."
                }
            },
            {
                "function": "TextGenerator",
                "arguments": {
                    "text_prompt": "Generate ideas for marketing campaigns to increase WishyWash detergent sales, focusing on the new UltraClean formula and softener."
                }
            }
        ]
        return plan
    else:
        pass

def execute_plan(plan):
    results = []
    for step in plan:
        if step["function"] == "ImageGenerator":
            generator = ImageGenerator(api_key="your_api_key")
            result = generator.generate_image(**step["arguments"])
            results.append(result)
        elif step["function"] == "TextGenerator":
            generator = TextGenerator(api_key="your_api_key")
            result = generator.generate_text(**step["arguments"])
            results.append(result)
        elif step["function"] == "CodeGenerator":
            generator = CodeGenerator(api_key="your_api_key")
            result = generator.generate_code(**step["arguments"])
            results.append(result)
    return results

question = "How can we create a marketing campaign for our new detergent?"
plan = plan_and_execute(question)
results = execute_plan(plan)
for result in results:
    print(result)

Build your own LLM Agent

  • Here’s a detailed explanation including some Python code examples as outlined in the NVIDIA blog for building a question-answering LLM agent:
  1. Set Up the Agent’s Components:
    • Tools: Include tools like a Retrieval-Augmented Generation (RAG) pipeline and mathematical tools necessary for data analysis.
    • Planning Module: A module to decompose complex questions into simpler parts for easier processing.
    • Memory Module: A system to track and remember previous interactions and solutions.
    • Agent Core: The central processing unit of the agent that uses the other components to solve user queries.
  2. Python Code Example for the Memory Module:
    class Ledger:
        def __init__(self):
            self.question_trace = []
            self.answer_trace = []
    
        def add_question(self, question):
            self.question_trace.append(question)
    
        def add_answer(self, answer):
            self.answer_trace.append(answer)
    
  3. Python Code Example for the Agent Core:
    • This part of the code defines how the agent processes questions, interacts with the planning module, and retrieves or computes answers.
      def agent_core(question, context):
        # Assume a function LLM is defined to handle LLM processing
        action = LLM(context + question)
      
        if action == "Decomposition":
            sub_questions = LLM(question)
            for sub_question in sub_questions:
                agent_core(sub_question, context)
        elif action == "Search Tool":
            answer = RAG_Pipeline(question)
            context += answer
            agent_core(question, context)
        elif action == "Generate Final Answer":
            return LLM(context)
        elif action == "<Another Tool>":
            # Execute another specific tool
            pass
      
  4. Execution Flow:
    • The agent receives a question, and based on the context and internal logic, decides if it needs to decompose the question, search for information, or directly generate an answer.
    • The agent can recursively handle sub-questions until a final answer is generated.
  5. Using the Components Together:
    • All the components are used in tandem to manage the flow of data and information processing within the agent. The memory module keeps track of all queries and responses, which aids in contextual understanding for the agent.
  6. Deploying and Testing the Agent:
    • Once all components are integrated, the agent is tested with sample queries to ensure it functions correctly and efficiently handles real-world questions.

Function/Tool Calling

  • This section focuses on evaluating the performance of function or tool-calling capabilities within various scenarios using Python and non-Python programming languages. The evaluation categories are designed to test different aspects of model performance, including how well the model can handle various types of function calls and detect when a function is relevant or not. A key aspect under scrutiny is how models can invoke functions accurately based on user input and whether they can discern the need for function calls in conversational contexts.
  • Based on the Berkeley Function-Calling Leaderboard, the evaluation framework can be hierarchically organized by function type (simple, multiple, parallel, parallel multiple) and evaluation method (AST or execution). This categorization helps in comparing model performances on standard function-calling scenarios and assessing their accuracy and efficiency. By structuring the evaluation in this way, it provides a comprehensive view of how well the model performs across different types of function calls and under varying conditions.

Python Evaluation

  • The Python evaluation categories, listed below, assess the model’s ability to handle single and multiple function calls, both sequentially and in parallel. These tests simulate realistic scenarios where the model must interpret user queries, select appropriate functions, and execute them accurately, mimicking real-world applications. By testing these different scenarios, the evaluation can highlight the model’s proficiency in using Python-based function calls under varying degrees of complexity and concurrency.
  1. Simple Function: In this category, the evaluation involves a single, straightforward function call. The user provides a JSON function document, and the model is expected to invoke only one function call. This test examines the model’s ability to handle the most common and basic type of function call correctly.

  2. Parallel Function: This evaluation scenario requires the model to make multiple function calls in parallel in response to a single user query. The model must identify how many function calls are necessary and initiate them simultaneously, regardless of the complexity or length of the user query.

  3. Multiple Function: This category involves scenarios where the user input can be matched to one function call out of two to four available JSON function documentations. The model must accurately select the most appropriate function to call based on the given context.

  4. Parallel Multiple Function: This is a complex evaluation combining both parallel and multiple function categories. The model is presented with multiple function documentations, and each relevant function may need to be invoked zero or more times in parallel.

  • Each Python evaluation category includes both Abstract Syntax Tree (AST) and executable evaluations. Executable evaluations simulate real-world scenarios, utilizing functions inspired by REST APIs and computational tasks to test the model’s stability and accuracy in function call generation.

Non-Python Evaluation

  • The non-Python evaluation categories, listed below, test the model’s ability to handle diverse scenarios involving conversation, relevance detection, and the use of different programming languages and technologies. These evaluations provide insights into the model’s adaptability to various contexts beyond Python. By including these diverse categories, the evaluation aims to ensure that the model is versatile and capable of handling various use cases, making it applicable in a broad range of applications.
  1. Chatting Capability: This category evaluates the model’s general conversational abilities without invoking functions. The goal is to see if the model can maintain coherent dialogue and recognize when function calls are unnecessary. This is distinct from function relevance detection, which involves determining the suitability of invoking any provided functions.

  2. Function Relevance Detection: This tests whether the model can discern when none of the provided functions are relevant. The ideal outcome is that the model refrains from making any function calls, demonstrating an understanding of when it lacks the required function information or user instruction.

  3. REST API: This evaluation focuses on the model’s ability to generate and execute realistic REST API calls using Python’s requests library. It tests the model’s understanding of GET requests, including path and query parameters, and its ability to generate calls that match real-world API documentation.

  4. SQL: This category assesses the model’s ability to construct simple SQL queries using custom sql.execute functions. The evaluation is limited to basic SQL operations like SELECT, INSERT, UPDATE, DELETE, and CREATE, testing whether the model can generalize function-calling capabilities beyond Python.

  5. Java + JavaScript: Despite the uniformity in function-calling formats across languages, this evaluation examines how well the model adapts to language-specific types and syntax, such as Java’s HashMap. It includes examples that test the model’s handling of Java and JavaScript, emphasizing the need for language-specific adaptations.

Evaluation Metrics

  • Two primary metrics are used to evaluate model performance:
  1. Abstract Syntax Tree (AST) Evaluation: AST evaluation involves parsing the model-generated function calls to check their structure against expected outputs. It verifies the function name, parameter presence, and type correctness. AST evaluation is ideal for cases where execution isn’t feasible due to language constraints or when the result cannot be easily executed.

    • Simple Function AST Evaluation
      • The AST evaluation process focuses on comparing a single model output function against its function doc and possible answers. Here is a flow chart (source) that shows the step-by-step evaluation process.
    • Multiple/Parallel/Parallel-Multiple Functions AST Evaluation
      • The multiple, parallel, or parallel-multiple function AST evaluation process extends the idea in the simple function evaluation to support multiple model outputs and possible answers.
        • The evaluation process first associates each possible answer with its function doc. Then it iterates over the model outputs and calls the simple function evaluation on each function (which takes in one model output, one possible answer, and one function doc).
          • The order of model outputs relative to possible answers is not required. A model output can match with any possible answer.
      • The evaluation employs an all-or-nothing approach to evaluation. Failure to find a match across all model outputs for any given possible answer results in a failed evaluation.
  2. Executable Function Evaluation: This metric assesses the model by executing the function calls it generates and comparing the outputs against expected results. This evaluation is crucial for testing real-world applicability, focusing on whether the function calls run successfully, produce the correct types of responses, and maintain structural consistency in their outputs.

  • The combination of AST and executable evaluations ensures a comprehensive assessment, providing insights into both the syntactic and functional correctness of the model’s output.

Multi-agent Collaboration

Background

  • Multi-agent collaboration is increasingly used as a key AI design pattern for managing complex tasks. This approach divides large tasks into smaller subtasks assigned to specialized agents, such as software engineers, product managers, designers, and QA engineers. Each agent performs specific functions and can be built using the same or different Large Language Models (LLMs).
  • This concept parallels multi-threading in software development, where tasks are broken down to be handled efficiently by different processors or threads.

Motivation

  • The motivation for using multi-agent systems is threefold:
    1. Proven Effectiveness: Teams have reported positive results using this approach. Studies like those mentioned in the AutoGen paper have shown that multi-agent systems can outperform single-agent systems in complex tasks.
    2. Optimized Task Handling: Despite advancements in LLMs, focusing on specific, simpler tasks can yield better performance. This method allows developers to optimize each component by specifying critical aspects of subtasks.
    3. Complex Task Decomposition: This design pattern provides a framework to break down complex tasks into manageable subtasks, simplifying the development process and enhancing workflow and interaction among agents.

Evaluation

  • Quantifying and objectively evaluating LLM-based agents remains challenging despite their performance in various domains. Benchmarks designed to evaluate LLM agents include:
  • Evaluation dimensions include:
    • Utility: Task completion effectiveness and efficiency, measured by success rate and task outcomes.
    • Sociability: Language communication proficiency, cooperation, negotiation abilities, and role-playing capability.
    • Values: Adherence to moral and ethical guidelines, honesty, harmlessness, and contextual appropriateness.
    • Ability to Evolve Continually: Continual learning, autotelic learning ability, and adaptability to new environments.
    • Adversarial Robustness: Susceptibility to adversarial attacks, with techniques like adversarial training and human-in-the-loop supervision employed.
    • Trustworthiness: Calibration problems and biases in training data affect trustworthiness. Efforts are made to guide models to exhibit thought processes or explanations to enhance credibility.

Common Use-cases of LLM Agents

  • Customer Support: Automate and manage customer service interactions, offering 24/7 support.
  • Content Creation: Aid in generating articles, blog posts, and social media content.
  • Education: Act as virtual tutors to aid students and support language learning.
  • Coding Assistance: Offer coding suggestions and debugging help to developers.
  • Healthcare: Provide medical information, interpret medical literature, and offer counseling.
  • Accessibility: Enhance accessibility for individuals with disabilities by vocalizing written text.

Frameworks/Libraries

AutoGen Studio

  • Microsoft Research’s AutoGen Studio is a low-code interface for rapidly prototyping AI agents. It’s built on top of the AutoGen framework and can also be used for debugging and evaluating multi-agent workflows.

Further Reading

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

  • This paper by Wu et al. from Microsoft Research, Pennsylvania State University, University of Washington, and Xidian University, introduces AutoGen, an open-source framework designed to facilitate the development of multi-agent large language model (LLM) applications. The framework allows the creation of customizable, conversable agents that can operate in various modes combining LLMs, human inputs, and tools.
  • AutoGen agents can be easily programmed using both natural language and computer code to define flexible conversation patterns for different applications. The framework supports hierarchical chat, joint chat, and other conversation patterns, enabling agents to converse and cooperate to solve tasks. The agents can hold multiple-turn conversations with other agents or solicit human inputs, enhancing their ability to solve complex tasks.

  • Key technical details include the design of conversable agents and conversation programming. Conversable agents can send and receive messages, maintain internal context, and be configured with various capabilities such as LLMs, human inputs, and tools. These agents can also be extended to include more custom behaviors. Conversation programming involves defining agent roles and capabilities and programming their interactions using a combination of natural and programming languages. This approach simplifies complex workflows into intuitive multi-agent conversations.
  • Implementation details:
    1. Conversable Agents: AutoGen provides a generic design for agents, enabling them to leverage LLMs, human inputs, tools, or a combination. The agents can autonomously hold conversations and solicit human inputs at certain stages. Developers can easily create specialized agents with different roles by configuring built-in capabilities and extending agent backends.
    2. Conversation Programming: AutoGen adopts a conversation programming paradigm to streamline LLM application workflows. This involves defining conversable agents and programming their interactions via conversation-centric computation and control. The framework supports various conversation patterns, including static and dynamic flows, allowing for flexible agent interactions.
    3. Unified Interfaces and Auto-Reply Mechanisms: Agents in AutoGen have unified interfaces for sending, receiving, and generating replies. An auto-reply mechanism enables conversation-driven control, where agents automatically generate and send replies based on received messages unless a termination condition is met. Custom reply functions can also be registered to define specific behavior patterns.
    4. Control Flow: AutoGen allows control over conversations using both natural language and programming languages. Natural language prompts guide LLM-backed agents, while Python code specifies conditions for human input, tool execution, and termination. This flexibility supports diverse multi-agent conversation patterns, including dynamic group chats managed by the GroupChatManager class.

  • The framework’s architecture defines agents with specific roles and capabilities, interacting through structured conversations to process tasks efficiently. This approach improves task performance, reduces development effort, and enhances application flexibility. Key technical aspects include using a unified interface for agent interaction, conversation-centric computation for defining agent behaviors, and conversation-driven control flows that manage interactions among agents.
  • Applications demonstrate AutoGen’s capabilities in various domains:
    • Math Problem Solving: AutoGen builds systems for autonomous and human-in-the-loop math problem solving, outperforming other approaches on the MATH dataset.
    • Retrieval-Augmented Code Generation and Question Answering: The framework enhances retrieval-augmented generation systems, improving performance on question-answering tasks through interactive retrieval mechanisms.
    • Decision Making in Text World Environments: AutoGen implements effective interactive decision-making applications using benchmarks like ALFWorld.
    • Multi-Agent Coding: The framework simplifies coding tasks by dividing responsibilities among agents, improving code safety and efficiency.
    • Dynamic Group Chat: AutoGen supports dynamic group chats, enabling collaborative problem-solving without predefined communication orders.
    • Conversational Chess: The framework creates engaging chess games with natural language interfaces, ensuring valid moves through a board agent.
  • The empirical results indicate that AutoGen significantly outperforms existing single-agent and some multi-agent systems in complex task environments by effectively integrating and managing multiple agents’ capabilities. The paper includes a figure illustrating the use of AutoGen to program a multi-agent conversation, showing built-in agents, a two-agent system with a custom reply function, and the resulting automated agent chat.
  • The authors highlight the potential for AutoGen to improve LLM applications by reducing development effort, enhancing performance, and enabling innovative uses of LLMs. Future work will explore optimal multi-agent workflows, agent capabilities, scaling, safety, and human involvement in multi-agent conversations. The open-source library invites contributions from the broader community to further develop and refine AutoGen.

References

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledAgents,
  title   = {Agents},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://vinija.ai}}
}