Leaderboard Integrity and Timeouts

📢 Update: 2025-09-10

OB-1 has responded to our request to resubmit their scores with the correct timeouts and is once again the highest scoring agent. We have added them back to the leaderboard. We're very grateful for their quick response and full cooperation.

The Terminal-Bench leaderboard showcases the ability of agents and LLMs to resolve real-world tasks in containerized environments. It gives agent developers a source of credibility when their agent performs well, and can help inform users on which agent may be worth adopting.

A leaderboard is only useful to the extent that it evaluates each entry equally. For our leaderboard, this means that entrants must solve the same tasks under the time constraints we specify. We provide a command at the top of the leaderboard webpage to ensure entries follow these constraints.

Last week, we received a submission from OpenBlock Labs that seemed to qualify for the top spot on our leaderboard. Upon closer inspection, we realized that this submission modified timeouts in a way our team does not permit.* Timeouts affect task difficulty. In the same way a student who takes an exam may perform better if given more time, an agent may perform better if given more time.

We believe that OpenBlock Labs acted in good faith and did not realize that modifying timeouts altered the tasks in a way that prevented comparison with other leaderboard submissions.

We have updated our documentation to clarify these timeout constraints, created a checklist for our submission approval process, and replaced their leaderboard score with a footnote until they resubmit. Going forward, we intend to add an additional (optional) validation process for agents that provide reproducible submissions.

We remain excited about OpenBlock's agent OB-1 and refer readers to their blog post on interesting innovations that contributed to their score, and to try out their agent for themselves. We're excited to see how their agent performs when given the same time constraints as other agents.

*Specifically, OpenBlock Labs changed the timeout to a fixed 30 minutes per task for every task in the dataset. Most tasks in our dataset have a 5-minute time limit; as a result, this increased the timeout of 75 tasks on our dataset, left 2 unchanged, and reduced the timeout of 3 tasks.