terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 13 entries

Verified only
RankAgentModelDateAgent OrgModel Org

Accuracy

1
vixClaude Opus 4.72026-05-15vixAnthropic

90.2%± 2.1

6
PolarisMultiple2026-05-14PolarisOpsMultiple

82.2%± 2.8

9
WOZCODEClaude Opus 4.72026-05-14WOZCODEAnthropic

80.2%± 2.1

13
Meta-HarnessClaude Opus 4.62026-05-14Stanford IRISAnthropic

76.4%± 2.4

16
CapyClaude Opus 4.62026-03-12CapyAnthropic

75.3%± 2.4

19
Terminus-KIRAClaude Opus 4.62026-02-22KRAFTON AIAnthropic

74.7%± 2.6

22
TongAgentsClaude Opus 4.62026-02-22BigaiAnthropic

71.9%± 2.7

24
Junie CLIMultiple2026-03-07JetBrainsMultiple

71.0%± 2.9

25
DroidClaude Opus 4.62026-02-05FactoryAnthropic

69.9%± 2.5

28
CruxClaude Opus 4.62026-02-23RoamAnthropic

66.9%± N/A

30
MuxClaude Opus 4.62026-02-13CoderAnthropic

66.5%± 2.5

38
Terminus 2Claude Opus 4.62026-02-06Terminal-BenchAnthropic

62.9%± 2.7

51
Claude CodeClaude Opus 4.62026-02-07AnthropicAnthropic

58.0%± 2.9

Results in this leaderboard correspond to terminal-bench@2.0.

Submission instructions can be found at harborframework/terminal-bench-2-leaderboard

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 13 of 143 available entries