A novel benchmark designed to test language models' ability to reason across facts distributed in extremely long documents. Features 20 reasoning tasks including fact chaining, simple induction, deduction, counting, and handling lists/sets. Uses bAbI tasks hidden within PG19 book text to create challenging scenarios with context lengths up to 10 million tokens.
from benchthing import Bench
bench = Bench("babilong")
bench.run(
benchmark="babilong",
task_id="1",
models=['gpt-4', 'claude-2', 'mistral-7b', 'yourLongContextModel']
)
result = bench.get_result("1")