A realistic text-based question answering dataset including 950K question-answer pairs from 662K documents collected from Wikipedia and the web. More challenging than standard QA benchmarks as answers may not be directly obtained by span prediction and contexts are very long. Includes both human-verified and machine-generated QA subsets.
from benchthing import Bench
bench = Bench("trivia-qa")
bench.run(
benchmark="trivia-qa",
task_id="1",
models=yourQuestionAnsweringModel
)
result = bench.get_result("1")