A large-scale evaluation suite containing 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators. Unlike other benchmarks that use standard multi-choice questions, MEGA-Bench supports diverse output formats including numbers, phrases, code, LaTeX, coordinates, JSON, and free-form text, evaluated using over 40 different metrics.
from benchthing import Bench
bench = Bench("mega-bench")
bench.run(
benchmark="mega-bench",
task_id="1",
models=['multimodal-model-1', 'multimodal-model-2']
)
result = bench.get_result("1")