Open Computer Use: AI Agents with Hands-On Desktop Control

10月17日 Published inAI Tools

Open Computer Use empowers AI agents to operate computers much like humans do. It manages browser automation, terminal commands, and desktop application control, allowing the system to execute complex tasks autonomously. By supporting multiple AI providers and allowing you to use your own API keys, the platform ensures data privacy and predictable costs. Users receive real-time feedback throughout the process, including progress bars, visual tool logs, and live screenshots from the virtual machine. A sophisticated multi-agent system breaks down broad requests into smaller, logical steps. For security, everything runs inside isolated, disposable Docker containers. Common applications include data research, automated testing, content creation, server operations, e-commerce monitoring, and business intelligence.

What the AI Actually Does

Human-like Web Browsing: Searches, clicks, fills out forms, and scrapes data. • System Management: Executes terminal commands, edits files, and manages directories. • UI Automation: Controls desktop applications directly through the user interface. • Task Coordination: Orchestrates multiple agents to divide and complete complex assignments. • Live Execution Streaming: Provides real-time visual and text feedback. • Self-Hosted Flexibility: Fully open-source and ready for local deployment.

Open Computer Use provides "computer use" capabilities similar to those found in Anthropic’s Claude, but with the benefit of being open and extensible. This allows developers to customize the system to meet specific requirements.

Browser Agent

• Employs a search-first strategy via the Google Search API. • Utilizes smart navigation and automatic form-filling to eliminate manual effort. • Identifies page elements precisely for accurate clicking and interaction. • Supports multiple tabs for parallel workflows. • Captures page context so the AI understands visual and textual elements. • Generates screenshots for visual verification of actions.

Terminal Agent

• Executes commands within isolated environments. • Performs file operations, including reading, writing, editing, and deleting. • Maintains full directory control. • Runs Python, Node.js, and bash scripts. • Handles package installation and runtime configuration. • Streams live output for monitoring.

Desktop Agent

• Uses computer vision to identify UI elements. • Performs mouse movements, keyboard inputs, and application launches. • Manages window resizing, focusing, and arrangement. • Interprets screenshots to make context-aware decisions. • Extracts text from images using OCR. • Functions natively on Linux desktop systems.

Multi-Agent System

• An AI planner decomposes goals into actionable steps. • Tasks are executed in sequence, maintaining context between each phase. • Specialized agents are deployed to handle specific skill sets. • Automatic retries are triggered upon encountering errors to ensure task completion. • Requests user confirmation when necessary. • Compiles detailed post-task reports for review.

Three-Layer Architecture

Frontend (Next.js 15): A comprehensive dashboard featuring the chat UI, model selection, and VM controls for interaction and observation. • Backend API (FastAPI): The core of the multi-agent system, housing the planner and various agents. It manages WebSocket VM control, database services, and billing. • Docker VM (Ubuntu 22.04 + XFCE): A pre-configured environment with Chrome, a terminal, and various desktop utilities. It includes a WebSocket proxy on port 8080 and VNC on port 5900.

Set Up Open Computer Use in Four Steps

Requirements include Node.js 20+ (with npm), Python 3.10+ (with pip), Docker and Docker Compose, a Supabase account (free tier is sufficient), and API keys from providers like OpenAI or Anthropic.

1. Clone the Repository Run git clone https://github.com/LLmHub-dev/open-computer-use.git and then cd open-computer-use.

2. Configure Supabase

• Create a new Supabase project. Once initialized, navigate to Project Settings → API to retrieve your keys. • Apply the database schema. Paste the contents of supabase/schema.sql into the Supabase SQL Editor and run it. Alternatively, use the Supabase CLI or manual psql commands. • The schema includes tables for users, authentication, chats, agents, billing, and projects.

3. Set Environment Variables

Frontend: Copy .env.example to .env and enter the required details. • Backend: Copy backend/.env.example to backend/.env and fill in the fields. • Ensure you include Supabase keys, an encryption key, a CSRF secret, a Google Search API key, and at least one AI provider key. Optional fields include Azure Container Instances and Stripe keys.

4. Install Dependencies and Start Services

Frontend: Run npm install. • Backend: Navigate to the backend directory, create and activate a virtual environment, then run pip install -r requirements.txt. • Launch: The easiest method is via Docker: docker-compose up --build. The frontend will be available at http://localhost:3000 and the backend at http://localhost:8001. For manual starts, use npm run dev for the frontend and python main.py for the backend. If you require the AI desktop, run docker-compose -f docker-compose.ai-desktop.yml up --build.

5. Start Your First Agent Session

Navigate to http://localhost:3000, sign up or log in via Supabase, and start a new chat. Enter a command such as "Find the latest AI news and summarize the top three articles" to see the agent in action.

Where Open Computer Use Fits

The platform allows you to bring your own API keys and switch providers mid-conversation. Supported models include OpenAI (GPT-4, GPT-4 Turbo, GPT-3.5), Anthropic (Claude 3.5 Sonnet, Claude 3 Opus), Google (Gemini Pro, Gemini 1.5), Azure OpenAI, xAI (Grok), Mistral AI (Mistral Large, Mixtral), Perplexity, and OpenRouter (access to over 100 models). All keys are stored with encryption, and you maintain full control over your spending.

You can monitor the agent's work in real time through progress indicators, tool-call visuals, live VM screenshots, and detailed logs. The AI autonomously breaks down requests, delegates subtasks to specialized agents, and maintains context throughout the workflow.

Every session is isolated within a fresh Docker container. These environments are temporary; no data is saved to the disk unless specifically requested. You can also restrict network access, set resource limits, and monitor usage for enhanced security.

Core Use Cases

Research & Data: Web scraping, data extraction, competitor analysis, market research, and academic paper gathering.

Testing & QA: Automated UI testing, cross-browser verification, end-to-end test generation, and regression testing.

Content Creation: Capturing screenshots, recording tutorials, and generating product demos.

DevOps & Automation: Server configuration, deployment script execution, log analysis, and system monitoring.

E-commerce: Price tracking, product research, order management, and inventory monitoring.

Business Intelligence: Automated report generation, dashboard monitoring, data workflows, and KPI tracking.