terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 10 entries

RankAgentModelDateAgent OrgModel Org

Accuracy

1
ForgeCodeClaude Opus 4.62026-03-12ForgeCodeAnthropic

81.8%± 1.7

7
CapyClaude Opus 4.62026-03-12CapyAnthropic

75.3%± 2.4

10
Terminus-KIRAClaude Opus 4.62026-02-22KRAFTON AIAnthropic

74.7%± 2.6

13
TongAgentsClaude Opus 4.62026-02-22BigaiAnthropic

71.9%± 2.7

14
Junie CLIMultiple2026-03-07JetBrainsMultiple

71.0%± 2.9

16
DroidClaude Opus 4.62026-02-05FactoryAnthropic

69.9%± 2.5

19
CruxClaude Opus 4.62026-02-23RoamAnthropic

66.9%± N/A

20
MuxClaude Opus 4.62026-02-13CoderAnthropic

66.5%± 2.5

28
Terminus 2Claude Opus 4.62026-02-06AfterQueryAnthropic

62.9%± 2.7

39
Claude CodeClaude Opus 4.62026-02-07AnthropicAnthropic

58.0%± 2.9

Results in this leaderboard correspond to terminal-bench@2.0.

Submission instructions can be found at harborframework/terminal-bench-2-leaderboard

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 10 of 122 available entries