The first comprehensive benchmark featuring an average data length surpassing 100K tokens. Comprises 12 unique tasks across diverse domains (novels, code, math, etc.) in both English and Chinese. Tasks include both real-world scenarios and synthetic constructs designed to test specific capabilities like information retrieval, state preservation, and sequential processing in long contexts.
from benchthing import Bench
bench = Bench("infinitebench")
bench.run(
benchmark="infinitebench",
task_id="1",
models=['gpt-4', 'claude-2', 'yarn-mistral-7b', 'yourLongContextModel']
)
result = bench.get_result("1")