StableToolBench

A benchmark evolving from ToolBench that introduces a virtual API server and stable evaluation system. Features a caching system and API simulators to handle API status changes, plus stable evaluation metrics using GPT-4 as automatic evaluator. Includes comprehensive testing across instruction following, tool selection, and multi-step reasoning.

from benchthing import Bench

bench = Bench("stable-tool-bench")

bench.run(
    benchmark="stable-tool-bench",
    task_id="1",
    models=['gpt-4-turbo', 'gpt-3.5-turbo']
)

result = bench.get_result("1")

StableToolBench

Sign up to get access to the StableToolBench benchmark API