Overview

Agentbench

Agentbench is a sophisticated evaluation framework designed to assess the capabilities of Large Language Models (LLMs) when functioning as autonomous agents. It presents a variety of complex tasks and scenarios to test the models' ability to understand, plan, and execute actions in diverse environments.

import { Bench } from 'benchthing';

const bench = new Bench('agentbench');

await bench.run({
  benchmark: 'agentbench',
  taskId: '1',
  agents: yourAgents,
});

const result = await bench.getResult('1');

Sign up to get access to the Agentbench benchmark API