Sun Apr 19 2026News

Leaderboard Integrity Update

New policies to address cheating and reward hacking on the Terminal-Bench leaderboard.

The Terminal-Bench community has identified multiple instances of cheating* and reward hacking**. Given these occurrences, we are taking steps to make the Terminal-Bench 2.0 leaderboard more reliable.

  • ATIF trajectories are required for all passing trials
  • Reward hacking will result in a reward of 0 for a trial (e.g. finding solutions on the internet)
  • Cheating will result in a submission being taken down immediately. Our team will determine whether the organization can resubmit on a case-by-case basis.

Reward hacking occurs when a model exploits a loophole to resolve a task without demonstrating the capability the task was intended to measure. Most often, the loophole is agents accessing solutions on the internet, an unfortunate possibility as a publicly-available, open-internet benchmark.

To detect reward hacking, we will run an agent judge over all passing trials in a submission. Submitters can challenge claims. We will open-source our judge so submitters can validate their submissions before uploading.

We consider cheating to be any case where the submitter alters the benchmark in a way that gives their agent an advantage or provides task-specific information to their agent. We have static checks to detect misconfiguration, however, we also rely on the community to help us retroactively detect cheating.

We appreciate the community's help in ensuring the integrity of the leaderboard and appreciate our submitters' cooperation with our policies moving forward. We will roll out the new submission process in the coming days.

The Terminal-Bench Team

*Instances of cheating **Instances of reward hacking
  • ForgeCode's agent begins by constructing an AGENTS.md file. In multiple instances, their agent curls the solution from the internet and includes it in its AGENTS.md. We have rescored those trials to 0.
Special thanks to Adam Stein and Davis Brown for detecting Pilot's cheating and ForgeCode's occasional reward hacking and the Ante team for detecting OpenBlock's cheating.