Autonomous Browsing using Large Action Models

Wednesday 14:30 in Hassium

Large Action Models (LAMs) were first introduced by Rabbit with the launch of their R1 device, aiming to create end-to-end trained models that automatically translate human instructions into actions. Since then, the definition of LAMs has evolved to encompass Large Language Models (LLMs) utilized in multi-agent settings. Notable examples include Anthropic's "Computer Use" feature in their Claude model and Google's Project Mariner. These projects allow LLMs to operate a web browser or computer in a human-like manner by viewing the screen, moving the cursor, clicking buttons, and typing text, thereby fulfilling the original promise of LAMs by effectively translating human instructions into automated actions.

We present an innovative application of LAMs that automates the job application process using AI. Our system autonomously navigates unfamiliar website structures, fills out forms, handles document uploads, and manages cookie banners without human intervention. This level of automation streamlines the application process for job seekers while ensuring accurate and timely submissions.

To achieve this, we leveraged the LaVague framework, which employs a modular, agent-based approach:

Coordinator Agent: A central agent powered by a multimodal model coordinates the entire process. It has access to website visuals, user data (e.g., personal details, CV information), previous instructions, and the overall objective. Based on this information, it delegates tasks to specialized agents.
Navigation Control Agent: For simple website navigation, this agent utilizes a browser driver such as Selenium to directly interact (e.g., scroll) with the webpage.
Knowledge Agent: When additional information is required, this agent performs knowledge-intensive tasks using an LLM. Examples include researching specific details or restructuring CV data.
Navigation Engine Agent: For complex website interactions like inputting values or uploading files, this agent generates custom code for the browser driver. Using an LLM with access to the HTML code, it creates the necessary commands.

These agents work iteratively, performing tasks step by step until either the objective is achieved, or a maximum number of steps is reached.

By building a custom solution around the LaVague framework tailored specifically for the job application process, we successfully automated the entire workflow. In our presentation, we discuss our overall architecture, the challenges encountered during development and share valuable lessons learned for practical adoption.

Large Action Models like these highlight the transformative potential of AI in automating intricate tasks, bridging the gap between understanding human intentions and executing them in dynamic, real-world scenarios.

Nico Kreiling

Arne Grobrügge

Arne Grobrügge, M. Sc. Wirtschaftsinformatiker mit Schwerpunkt Maschinelles Lernen und Informationssicherheit, arbeitet als Data Scientist bei der scieneers GmbH. Im Rahmen von diversen Kundenprojekten entwickelt und überwacht er den Einsatz von Sprachmodellen und Mulit-Agenten Systemen in Unternehmen, um innovative und wertschöpfende Lösungen zu schaffen.