A realistic and reproducible web environment designed to facilitate the development of autonomous agents capable of executing tasks. GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%.
import { Bench } from 'benchthing';
const bench = new Bench('webarena');
await bench.run({
benchmark: 'webarena',
taskId: '1',
agents: [yourAgent1, yourAgent2],
});
const result = await bench.getResult('1');