Benchmarkthing

AI system evaluation execution without hassle

Run your AI evals or benchmarks on the cloud.
Save weeks of development with just 3 lines of code.

import { Bench } from 'benchthing';

const bench = new Bench('webarena');

await bench.run({
  benchmark: 'webarena',
  taskId: '1',
  agents: yourAgents,
});

const result = await bench.getResult('1');

Use Cases

Largest library of benchmarks

Utilize the largest library of benchmarks for comprehensive evaluations.

Extend existing benchmarks

Easily extend and customize existing benchmarks to fit your specific needs.

Create your own evals

Design and implement your own system evaluations with flexibility and ease.

What Our Users Say

Tianpei Gu

"If Benchmarkthing existed before, it would have saved me weeks of setting up miscellaneous sub-tasks in VLMs. I'm excited about using it to benchmark other Computer Vision tasks."

Tianpei Gu

Research Scientist at TikTok

Yitao Liu

"With Benchmarkthing's endpoint, I was able to focus on developing web agents instead of setting up configs and environments for the task execution."

Yitao Liu

NLP Researcher at Princeton

Gus Ye

"Using Benchmarkthing is like having Codecov but for our Retrieval-Augmented Generation (RAG) workflows. It makes them a lot more reliable."

Gus Ye

Senior AI engineer, Founder of Memobase.io

Popular Benchmarks

WebArena

A realistic web environment for developing autonomous agents. GPT-4 agent achieves 14.41% success rate vs 78.24% human performance.

SWE-bench

A benchmark for software engineering tasks.

Agentbench

A comprehensive benchmark to evaluate LLMs as agents (ICLR'24)

Tau (𝜏)-Bench

A benchmark for evaluating AI agents' performance in real-world settings with dynamic interaction.

BIRD-SQL

A pioneering, cross-domain dataset for evaluating text-to-SQL models on large-scale databases.

LegalBench

A collaboratively built benchmark for measuring legal reasoning in large language models.

STS (Semantic Textual Similarity)

A benchmark for evaluating semantic equivalence between text snippets.

GLUE MS Marco

A large-scale dataset for benchmarking information retrieval systems.