A benchmark designed to evaluate LLMs' ability to use functions/tools effectively. BFCL V3 introduces multi-turn and multi-step scenarios, testing models on complex interactions requiring state tracking, implicit actions, and error recovery. Includes diverse categories like Python functions, REST APIs, SQL queries, and domain-specific tools across 1000+ test cases.
from benchthing import Bench
bench = Bench("bfcl")
bench.run(
benchmark="bfcl",
task_id="1",
models=yourLanguageModels
)
result = bench.get_result("1")