NexusBench

A benchmark suite designed to evaluate LLMs on real-world enterprise-level function calling and agent scenarios. Includes specialized benchmarks for IT ticket systems, security tools (NVD, VirusTotal), and complex multi-step interactions. Used to evaluate models like Athene-V2 against GPT-4 on practical tool use cases.

from benchthing import Bench

bench = Bench("nexus-bench")

bench.run(
    benchmark="nexus-bench",
    task_id="1",
    agents=yourAgentModel
)

result = bench.get_result("1")

NexusBench

Sign up to get access to the NexusBench benchmark API