WebArena

A realistic and reproducible web environment designed to facilitate the development of autonomous agents capable of executing tasks. GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%.

import { Bench } from 'benchthing';

const bench = new Bench('webarena');

await bench.run({
  benchmark: 'webarena',
  taskId: '1',
  agents: [yourAgent1, yourAgent2],
});

const result = await bench.getResult('1');

WebArena

Sign up to get access to the WebArena benchmark API