Agents

Evals as an
API

Run out-of-the-box evals or benchmarks on the cloud. Save weeks of setting
up and development by using evals on our platform.

Idle

Nodejs

Python

HTTP

1from benchmarkthing import Bench
2
3bench = Bench("webarena")
4
5bench.run(
6    benchmark="webarena",
7    task_id="1", 
8    agents=your_agents
9)
10
11result = bench.get_result("1")

Backed By

Jeff Dean

Chief Scientist, Google

Arash Ferdowsi

Founder/CTO of Dropbox

+ more
Raised $1M+

Use Cases

Largest library of benchmarks

Utilize the largest library of benchmarks for comprehensive evaluations.

Extend existing benchmarks

Easily extend and customize existing benchmarks to fit your specific needs.

Create your own evals

Design and implement your own system evaluations with flexibility and ease.

What Our Users Say

Tianpei Gu

Research Scientist at TikTok

"If Benchmarkthing existed before, it would have saved me weeks of setting up miscellaneous sub-tasks in VLMs. I'm excited about using it to benchmark other Computer Vision tasks."

Yitao Liu

NLP Researcher at Princeton

"With Benchmarkthing's endpoint, I was able to focus on developing web agents instead of setting up configs and environments for the task execution."

Gus Ye

Founder of Memobase.io

"Using Benchmarkthing is like having Codecov but for our Retrieval-Augmented Generation (RAG) workflows. It makes them a lot more reliable."

Popular Benchmarks

All Categories

Agent

Code

General

Embedding

Performance

Vision

Long Context

Evals as an API