Holistic Evaluation of Language Models (HELM) is a comprehensive framework for evaluating language models across a wide range of scenarios. It includes over 40 scenarios spanning 16 categories, assessing models on various dimensions such as accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.
from benchthing import Bench
bench = Bench("helm")
bench.run(
benchmark="helm",
task_id="1",
models=yourLanguageModels
)
result = bench.get_result("1")