A benchmark evolving from ToolBench that introduces a virtual API server and stable evaluation system. Features a caching system and API simulators to handle API status changes, plus stable evaluation metrics using GPT-4 as automatic evaluator. Includes comprehensive testing across instruction following, tool selection, and multi-step reasoning.
from benchthing import Bench
bench = Bench("stable-tool-bench")
bench.run(
benchmark="stable-tool-bench",
task_id="1",
models=['gpt-4-turbo', 'gpt-3.5-turbo']
)
result = bench.get_result("1")