The Terminal-Bench community has identified multiple instances of cheating* and reward hacking**. Given these occurrences, we are taking steps to make the Terminal-Bench 2.0 leaderboard more reliable.
- ATIF trajectories are required for all passing trials
- Reward hacking will result in a reward of 0 for a trial (e.g. finding solutions on the internet)
- Cheating will result in a submission being taken down immediately. Our team will determine whether the organization can resubmit on a case-by-case basis.
Reward hacking occurs when a model exploits a loophole to resolve a task without demonstrating the capability the task was intended to measure. Most often, the loophole is agents accessing solutions on the internet, an unfortunate possibility as a publicly-available, open-internet benchmark.
To detect reward hacking, we will run an agent judge over all passing trials in a submission. Submitters can challenge claims. We will open-source our judge so submitters can validate their submissions before uploading.
We consider cheating to be any case where the submitter alters the benchmark in a way that gives their agent an advantage or provides task-specific information to their agent. We have static checks to detect misconfiguration, however, we also rely on the community to help us retroactively detect cheating.
We appreciate the community's help in ensuring the integrity of the leaderboard and appreciate our submitters' cooperation with our policies moving forward. We will roll out the new submission process in the coming days.
The Terminal-Bench Team
*Instances of cheating- OB-1 from OpenBlock modified timeouts in their original Terminal-Bench 1.0 submission. They claim they were not aware that modifying timeouts was cheating, and resubmitted shortly after.
- OB-1 from OpenBlock stored encrypted solutions in their agent binary. They were removed from the leaderboard and have since issued a public apology.
-
Pilot from QuantFlow uploaded the
tests/folder from each task as part of their agent setup. They claim it was an accidental artifact of how they ported their CI code into their official submission. They were removed from the leaderboard and invited to resubmit.
-
ForgeCode's agent begins by constructing an
AGENTS.mdfile. In multiple instances, their agent curls the solution from the internet and includes it in itsAGENTS.md. We have rescored those trials to 0.
Written by