**Tianci Xue$^1$, Weijian Qi$^{1*}$, Tianneng Shi$^{2*}$, Chan Hee Song$^1$, Boyu Gou$^1$, Dawn Song$^2$,**

**Huan Sun$^{1†}$, Yu Su$^{1†}$ ($^*$: Equal Contribution, $^{†}$: Equal advising)**

$^1$The Ohio State University, $^2$University of California, Berkeley

🏆 Leaderboard | Paper | Data | Code

Table of Contents

Figure 1. Frontier web agents show a drastic drop in task success rate when evaluated on our Online-Mind2Web benchmark compared with their previously self-reported results on WebVoyager. Surprisingly, many recent agents, except Operator, do not outperform the simple SeeAct agent (Zheng et al., 2024) released at the beginning of 2024. Claude Computer Use is based on Claude 3.5 to be comparable with the previously reported WebVoyager results.

Figure 1. Frontier web agents show a drastic drop in task success rate when evaluated on our Online-Mind2Web benchmark compared with their previously self-reported results on WebVoyager. Surprisingly, many recent agents, except Operator, do not outperform the simple SeeAct agent (Zheng et al., 2024) released at the beginning of 2024. Claude Computer Use is based on Claude 3.5 to be comparable with the previously reported WebVoyager results.

1 Introduction

If you were at Vancouver in December, 2024 attending NeurIPS, you’d notice one thing that seemed to be echoing in the beautiful Vancouver Convention Center throughout the conference: 2025 will be the year of agents! It surely felt like things were getting ready to brew a perfect storm for agents: Agentic workflows already proved valuable in many use cases, multimodal LLMs that are foundational to more autonomous agents had been rapidly improving, Anthropic released the Claude Computer Use agent just two months ago, open-source efforts like Browser Use just came out claiming a whopping 89% success rate on web agent tasks, and everyone knew OpenAI had been building their agents at full steam (which later became Operator and Deep Research). It seemed that highly capable and practical agents were maybe indeed just months away.

As a research group who has been working on LLM-based agents (or what we call language agents) longer than perhaps most people—from Mind2Web to SeeAct to UGround and WebDreamer—we resonate strongly with the excitement in the community. However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict. As a scientific field, we must caution against over-optimism, especially when the supporting data may be insufficient or biased, because that leads to short-sightedness, unrealistic expectations, and irrational decisions. That motivated us to conduct a comprehensive and rigorous assessment of the current state of web agents, a popular type of agents where many people are expecting to see the first commercial successes.

TL;DR for the rest of the blog:

2 New Online-Mind2Web Benchmark

2.1 Why Introduce a New Benchmark?

To get an accurate assessment of the competency of web agents, we need to evaluate them on realistic tasks across a wide range of real-world websites under a setting that approximates how real users use such agents as much as possible. However, existing benchmarks fall short in several ways: