Agent S: An Open Agentic Framework that Uses Computers Like a Human

Significant development for the power of AI Agents, they are learning to use GUIs.

Nov 03, 2024

Significant development for the power of AI Agents, they are learning to use GUIs.

Agent-S uses a computer like a human to solve diverse desktop tasks on different systems, by breaking them down into bite-sized actions and remembering what works.

Experience-augmented hierarchical planning enables Agent S to handle diverse GUI tasks with improved performance.

The original Problem 🎯:

Automating complex computer tasks presents challenges in acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic interfaces.

Solution in this Paper 🛠️:

• Experience-augmented hierarchical planning:

Manager module for task decomposition
Worker modules for subtask execution
Self-evaluator for experience summarization

• Agent-Computer Interface (ACI):

Dual-input strategy for visual understanding and element grounding
Bounded action space of language-based primitives

• Continual memory update mechanism for ongoing learning

Key Insights from this Paper 💡:

• Combining external knowledge and internal experience enhances task planning

• Structured interface improves MLLM reasoning for GUI control

• Hierarchical planning supports long-horizon workflows

• Continual learning enables adaptation to new tasks and environments

Results 📊:

• OSWorld benchmark: 20.58% success rate (83.6% relative improvement over baseline)

• Consistent improvements across five computer task categories

• WindowsAgentArena: 18.2% success rate (36.8% improvement without adaptation)

• Ablation studies confirm effectiveness of individual components

🤖 Agent S addresses three main challenges in automating computer tasks:

Acquiring domain-specific knowledge for diverse applications
Planning over long task horizons
Handling dynamic, non-uniform interfaces

🧠 Agent S handles the experience-augmented hierarchical planning work by

A Manager module that decomposes complex tasks into subtasks using web knowledge and narrative memory
Worker modules that execute subtasks using episodic memory and trajectory reflection
A self-evaluator that summarizes experiences as textual rewards, updating narrative and episodic memories

Agent-Computer Interface (ACI) improves agent performance

The ACI is an abstraction layer that:

Uses a dual-input strategy: visual input for understanding environmental changes and an image-augmented accessibility tree for precise element grounding
Defines a bounded action space of language-based primitives conducive to Multimodal Large Language Model reasoning
Generates environment transitions at the right temporal resolution for observing immediate feedback

Rohan's Bytes

Discussion about this post