A systematic benchmark for evaluating LLMs in task automation using Tool Graph to represent decomposed tasks. Features a multi-faceted evaluation methodology (TaskEval) that assesses performance across task decomposition, tool selection, and parameter prediction stages. Combines automated construction with human verification for high-quality evaluation. Accepted to NeurIPS 2024.
from benchthing import Bench
bench = Bench("taskbench")
bench.run(
benchmark="taskbench",
task_id="1",
agents=yourTaskAutomationAgent
)
result = bench.get_result("1")