We're Hiring AI Researchers & Engineers

Evals as an
API

Run out-of-the-box evals or benchmarks on the cloud. Save weeks of setting
up and development by using evals on our platform.

Idle
Nodejs
Python
HTTP
1from benchmarkthing import Bench
2
3bench = Bench("webarena")
4
5bench.run(
6    benchmark="webarena",
7    task_id="1", 
8    agents=your_agents
9)
10
11result = bench.get_result("1")

Backed By

Backed By 1
Backed By 2
Backed By 3
Jeff Dean
Chief Scientist, Google
Arash Ferdowsi
Founder/CTO of Dropbox
+ more
Raised $1M+
BenchmarkthingBenchmarkthingBenchmarkthing

Use Cases

Largest library of benchmarks

Utilize the largest library of benchmarks for comprehensive evaluations.

Use Case 1
Use Case 2

Extend existing benchmarks

Easily extend and customize existing benchmarks to fit your specific needs.

Use Case 3

Create your own evals

Design and implement your own system evaluations with flexibility and ease.

Use Case 4

What Our Users Say

Testimonial

Tianpei Gu

Research Scientist at TikTok

"If Benchmarkthing existed before, it would have saved me weeks of setting up miscellaneous sub-tasks in VLMs. I'm excited about using it to benchmark other Computer Vision tasks."

Testimonial

Yitao Liu

NLP Researcher at Princeton

"With Benchmarkthing's endpoint, I was able to focus on developing web agents instead of setting up configs and environments for the task execution."

Testimonial

Gus Ye

Founder of Memobase.io

"Using Benchmarkthing is like having Codecov but for our Retrieval-Augmented Generation (RAG) workflows. It makes them a lot more reliable."

Popular Benchmarks

All Categories
Agent
Code
General
Embedding
Performance
Vision
Long Context