Aman's AI Journal • Primers • Computer Control

Overview
Anthropic’s Computer Use
- Advantages
- Limitations
OpenAI’s Operator
- Advantages
- Limitations
Manus
- Advantages
- Limitations
Comparative Analysis
Open-Source Implementations
Key Technologies
Benchmarks
Citation

Overview

With the ongoing advancements in AI, agents are becoming increasingly capable of interacting with computers in human-like ways. These multimodal (specifically, vision-language agents) are designed to automate tasks across different environments, from controlling entire desktop systems to handling browser-based workflows. Below is an overview of three prominent AI agents specializing in computer use: Anthropic’s “Computer Use,” OpenAI’s “Operator,” and “Manus.”

Anthropic’s Computer Use

Anthropic’s “Computer Use” feature, which utilizes their Claude language model under-the-hood, enables the AI to interact with a computer’s desktop environment in a manner akin to human users. This functionality allows Claude to perform tasks such as moving the cursor, clicking buttons, and typing text, effectively executing complex, multi-step operations across various applications. For example, Claude can plan an outing by searching online and scheduling it on a calendar app, or build a simple promotional website by coding and testing it.

Advantages

Comprehensive System Interaction: Claude’s ability to control the entire desktop environment allows it to interact with multiple applications beyond just the browser, facilitating more integrated workflows.

Limitations

Local Execution: Since Claude operates directly on the user’s local machine, it may not yield control back to the user during task execution, potentially leading to usability concerns.

OpenAI’s Operator

OpenAI’s “Operator” is a browser-based AI agent designed to handle repetitive tasks such as filling out forms, ordering groceries, and even creating memes. Operator can be asked to handle a wide variety of repetitive browser tasks, performing actions like clicking, scrolling, or typing.

Advantages

Cloud-Based Execution: Operator runs in a virtual browser session in the cloud, allowing users to offload tasks without impacting local system resources.
Browser-Centric Tasks: Optimized for web-based interactions, Operator excels in tasks confined to the browser environment, offering a streamlined experience for internet-related automation.

Limitations

Restricted to Browser: Operator’s functionality is limited to the browser and cannot interact with applications beyond it, potentially limiting its utility for tasks requiring broader system access.

Manus

Manus is a general AI agent developed by Monica.im, designed to bridge the gap between human thoughts and actions by autonomously executing tasks across various domains. Unlike traditional AI assistants that provide suggestions, Manus delivers complete results by performing tasks such as writing and executing code, browsing web pages, and operating applications within an independent virtual environment.

Advantages

Autonomous Task Execution: Manus can independently handle complex tasks, including report writing, data analysis, and content generation, reducing the need for constant human intervention.
Multi-Modal Capabilities: Manus processes and generates various types of data, such as text, images, and code, making it versatile in handling diverse tasks.
Advanced Tool Integration: The ability to interact with external applications, including web browsers and code editors, allows Manus to automate workflows effectively.

Limitations

Limited to Browser/Terminal/Code: Unlike Anthropic’s “Computer Use,” which can control the entire desktop, Manus is currently restricted to interacting within browser environments, command-line interfaces, and code execution, limiting its ability to manage full desktop applications../

Comparative Analysis

Each of these AI agents offers distinct features tailored to different user needs. The choice between these AI agents depends on the specific requirements of the task at hand—whether comprehensive system interaction is needed, browser-based automation suffices, or autonomous execution across various domains is desired.
- Scope of Interaction: Anthropic’s Computer Use provides comprehensive system interaction by controlling the desktop environment, suitable for tasks involving multiple applications. OpenAI’s Operator focuses on browser-based tasks, offering a controlled and secure environment for web interactions. Manus combines these approaches by operating within an independent virtual environment, allowing it to perform a wide range of tasks autonomously.
- Execution Environment: Computer Use operates locally, which may raise concerns about resource utilization and user control during task execution. Operator’s cloud-based approach mitigates these issues but confines its capabilities to the browser. Manus’s independent virtual environment offers a balance, enabling extensive task execution without directly impacting the user’s local system.
- Performance Benchmarks: Manus has reportedly achieved state-of-the-art performance in AI benchmarks such as GAIA, outperforming models like OpenAI’s GPT-4 and Microsoft’s AI systems. This suggests a high level of proficiency in handling complex, real-world tasks.

Open-Source Implementations

Several open-source projects have emerged, inspired by these AI automation capabilities:

Browser Use

This project aims to make websites accessible for AI agents by providing tools that facilitate browser automation. It enables AI models to interact with web pages, perform actions like clicking and typing, and extract information, thereby enhancing the AI’s ability to navigate and manipulate web content.

Open Operator

Inspired by OpenAI’s Operator, this template facilitates the development of web agents using Stagehand on Browserbase. It leverages open-source technologies such as Next.js and React to create robust browser automation solutions, enabling developers to build AI agents capable of performing complex web-based tasks.

Proxy Lite

Proxy Lite, by convergence.ai is a 3-billion-parameter VLM designed for web automation tasks. It operates through a structured process of observation, reasoning, and action, allowing it to perform tasks like searching for recipes or analyzing stock information. Despite its relatively small size, Proxy Lite demonstrates high success rates in web automation benchmarks, making it a resource-efficient option for developers.

OmniParser V2

General-purpose LLMs struggle with identifying interactable UI elements and associating intended actions with the correct regions on the screen.
Introduced in OmniParser V2: Turning Any LLM into a Computer Use Agent, OmniParser converts UI screenshots into structured, interpretable elements, allowing LLMs to predict next actions more effectively.
OmniParser V2 improves accuracy in detecting small elements, enhances inference speed, and reduces latency by 60% with optimized image processing.
OmniParser+GPT-4o achieves an average accuracy of 39.6 on the ScreenSpot Pro benchmark, a major improvement over GPT-4o’s 0.8.
OmniTool is a dockerized Windows system that integrates essential tools for GUI automation, enabling seamless use with multiple LLMs.
OmniParser supports OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL), and Anthropic (Sonnet) for screen understanding, grounding, and action execution.
Code, Hugging Face

Key Technologies

Collectively, these technologies underpinning these AI automation capabilities contribute to the development of sophisticated AI agents capable of automating a wide array of tasks, from web browsing to desktop application control, thereby enhancing productivity and user experience.

Browserbase

Browserbase is a platform designed to facilitate the development of AI agents capable of interacting with web browsers in a human-like manner. It provides core browser automation and interaction capabilities, allowing developers to create agents that can navigate websites, fill out forms, and perform other web-based tasks autonomously. By offering a robust framework for browser automation, Browserbase simplifies the process of building and deploying AI agents that require seamless web interaction.

Playwright

Playwright is an open-source automation library developed by Microsoft for browser testing and web scraping. It allows developers to script and automate interactions with web browsers, supporting multiple browser engines such as Chromium, Firefox, and WebKit. Playwright enables reliable end-to-end testing and automation, ensuring that web applications function correctly across different platforms and devices. Its capabilities are essential for developers aiming to create robust and cross-compatible web applications.

Stagehand

Stagehand is a Software Development Kit (SDK) built on top of Playwright, offering a higher-level API for precise Document Object Model (DOM) manipulation and state management. It enhances debugging capabilities and provides AI fail-safes, making browser automation more reliable and efficient. Stagehand enables developers to script complex interactions with web pages, ensuring that AI agents can handle dynamic content and respond to changes in the web environment effectively.

Benchmarks

LLM-based agents are increasingly being developed for autonomous tasks such as web navigation, tool use, and real-world problem-solving. However, their evaluation requires specialized benchmarks that assess not only static model performance but also interactive capabilities, adaptability, and long-term decision-making.
These benchmarks evaluate agents that operate across multiple modalities, such as vision, action, and interactive simulations, beyond pure text-based reasoning.

OSWorld

Description: OSWorld evaluates multimodal agents that support task setup, execution-based evaluation, and interactive learning across operating systems. It can serve as a unified environment for evaluating open-ended computer tasks that involve arbitrary applications.
Dataset Attributes: Designed to assess adaptability to new conditions and autonomous decision-making in an unpredictable setting.
Reference: “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments”

WebArena

Description: WebArena is a benchmark that assesses LLM agents’ ability to interact with web interfaces, perform searches, and complete form-filling tasks in realistic browser-based environments.
Dataset Attributes: Uses high-fidelity web simulations to test goal-directed web navigation and interaction capabilities.
Reference: “WebArena: A Realistic Web Environment for Building Autonomous Agents”

WebVoyager

Description: WebVoyager evaluates the autonomous exploration abilities of LLM agents, requiring them to browse, extract, and process information from diverse websites.
Dataset Attributes: Tests adaptability to unfamiliar web structures, goal-directed browsing, and information retrieval efficiency.
Reference: “WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models”

General AI Agent Benchmark (GAIA)

Description: GAIA assesses the robustness and problem-solving skills of AI agents across multiple domains, including gaming, online environments, and decision-making tasks.
Dataset Attributes: Emphasizes general intelligence, multi-step reasoning, and adaptability across different test environments.
Reference: “GAIA: A General AI Agent Benchmark”

Citation

If you found our work useful, please cite it as:

@article{Chadha2020DistilledComputerControl,
  title   = {Computer Control},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{https://vinija.ai}}
}