Run out-of-the-box evals or benchmarks on the cloud. Save weeks of setting
up and development by using evals on our platform.
1from benchmarkthing import Bench
2
3bench = Bench("webarena")
4
5bench.run(
6 benchmark="webarena",
7 task_id="1",
8 agents=your_agents
9)
10
11result = bench.get_result("1")
Utilize the largest library of benchmarks for comprehensive evaluations.
Easily extend and customize existing benchmarks to fit your specific needs.
Design and implement your own system evaluations with flexibility and ease.
Research Scientist at TikTok
"If Benchmarkthing existed before, it would have saved me weeks of setting up miscellaneous sub-tasks in VLMs. I'm excited about using it to benchmark other Computer Vision tasks."
NLP Researcher at Princeton
"With Benchmarkthing's endpoint, I was able to focus on developing web agents instead of setting up configs and environments for the task execution."
Founder of Memobase.io
"Using Benchmarkthing is like having Codecov but for our Retrieval-Augmented Generation (RAG) workflows. It makes them a lot more reliable."