terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench/terminal-bench-2 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench/terminal-bench-2 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 7 entries

Verified only
RankAgentModelDateAgent OrgModel Org

Accuracy

47
WarpMultiple2025-11-20WarpMultiple

59.1%± 2.8

65
WarpMultiple2025-11-11WarpMultiple

50.1%± 2.7

123
spoox-o-mGPT-5-Nano2026-05-15TUMOpenAI

21.8%± 2.8

136
Codex CLIGPT-5-Nano2025-11-04OpenAIOpenAI

11.5%± 2.3

137
OpenHandsGPT-5-Nano2025-11-02OpenHandsOpenAI

9.9%± 2.1

139
Terminus 2GPT-5-Nano2025-10-31Terminal-BenchOpenAI

7.9%± 1.9

140
Mini-SWE-AgentGPT-5-Nano2025-11-03PrincetonOpenAI

7.0%± 1.9

Results in this leaderboard correspond to terminal-bench@2.0.

Submission instructions can be found at harborframework/terminal-bench-2-leaderboard

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 7 of 142 available entries