Stanford HELM

Holistic Evaluation of Language Models (HELM) is a comprehensive framework for evaluating language models across a wide range of scenarios. It includes over 40 scenarios spanning 16 categories, assessing models on various dimensions such as accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

from benchthing import Bench

bench = Bench("helm")

bench.run(
    benchmark="helm",
    task_id="1",
    models=yourLanguageModels
)

result = bench.get_result("1")

Stanford HELM

Sign up to get access to the Stanford HELM benchmark API