HaluBench

A benchmark containing 15,000 Context-Question-Answer triplets annotated for hallucinations, sourced from real-world domains including finance and medicine. Built using examples from FinanceBench, PubmedQA, CovidQA, HaluEval, DROP and RAGTruth, it's designed to evaluate models' ability to detect hallucinations in challenging scenarios.

from benchthing import Bench

bench = Bench("halu-bench")

bench.run(
    benchmark="halu-bench",
    task_id="1",
    models=yourLanguageModels
)

result = bench.get_result("1")

HaluBench

Sign up to get access to the HaluBench benchmark API