TaskBench

A systematic benchmark for evaluating LLMs in task automation using Tool Graph to represent decomposed tasks. Features a multi-faceted evaluation methodology (TaskEval) that assesses performance across task decomposition, tool selection, and parameter prediction stages. Combines automated construction with human verification for high-quality evaluation. Accepted to NeurIPS 2024.

from benchthing import Bench

bench = Bench("taskbench")

bench.run(
    benchmark="taskbench",
    task_id="1",
    agents=yourTaskAutomationAgent
)

result = bench.get_result("1")

TaskBench

Sign up to get access to the TaskBench benchmark API