terminal-bench@2.0 Leaderboard
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5Showing 143 entries
Select agents
Select models
Select organizations
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy | |
|---|---|---|---|---|---|---|---|
1 | vix | Claude Opus 4.7 | 2026-05-15 | vix | Anthropic | 90.2%± 2.1 | |
2 | JJAgent | Multiple | 2026-05-15 | JJ | Multiple | 87.1%± 1.3 | |
3 | NexAU-AHE | GPT-5.5 | 2026-05-14 | china-qijizhifeng | OpenAI | 84.7%± 2.1 | |
4 | LemonHarness | Multiple | 2026-05-14 | LR AILab of Lenovo CTO Org | Multiple | 84.5%± 2.6 | |
5 | Capy | GPT-5.5 | 2026-05-14 | Capy | OpenAI | 83.1%± 2.1 | |
6 | Polaris | Multiple | 2026-05-14 | PolarisOps | Multiple | 82.2%± 2.8 | |
7 | Codex CLI | GPT-5.5 | 2026-04-23 | OpenAI | OpenAI | 82.0%± 2.2 | |
8 | TongAgents | Gemini 3.1 Pro | 2026-03-13 | BIGAI | 80.2%± 2.6 | ||
9 | WOZCODE | Claude Opus 4.7 | 2026-05-14 | WOZCODE | Anthropic | 80.2%± 2.1 | |
10 | LemonHarness | Multiple | 2026-05-14 | LR AILab of Lenovo CTO Org | Multiple | 79.9%± 3.0 | |
11 | SageAgent | GPT-5.3-Codex | 2026-03-13 | OpenSage | OpenAI | 78.4%± 2.2 | |
12 | Droid | GPT-5.3-Codex | 2026-02-24 | Factory | OpenAI | 77.3%± 2.2 | |
13 | Meta-Harness | Claude Opus 4.6 | 2026-05-14 | Stanford IRIS | Anthropic | 76.4%± 2.4 | |
14 | CodeBrain-1.5 | GPT-5.3-Codex | 2026-02-10 | Feeling AI | OpenAI | 75.8%± 2.0 | |
15 | Codelia | GPT-5.3-Codex | 2026-05-14 | kousw | OpenAI | 75.7%± 2.2 | |
16 | Capy | Claude Opus 4.6 | 2026-03-12 | Capy | Anthropic | 75.3%± 2.4 | |
17 | Simple Codex | GPT-5.3-Codex | 2026-02-06 | OpenAI | OpenAI | 75.1%± 2.4 | |
18 | Terminus-KIRA | Gemini 3.1 Pro | 2026-02-23 | KRAFTON AI | 74.8%± 2.6 | ||
19 | Terminus-KIRA | Claude Opus 4.6 | 2026-02-22 | KRAFTON AI | Anthropic | 74.7%± 2.6 | |
20 | Mux | GPT-5.3-Codex | 2026-03-06 | Coder | OpenAI | 74.6%± 2.5 | |
21 | MAYA-V2 | Claude 4.6 Opus | 2026-03-12 | ADYA | Anthropic | 72.1%± 2.2 | |
22 | TongAgents | Claude Opus 4.6 | 2026-02-22 | Bigai | Anthropic | 71.9%± 2.7 | |
23 | spoox-o-m | GPT-5.3-Codex | 2026-05-15 | TUM | OpenAI | 71.5%± 2.5 | |
24 | Junie CLI | Multiple | 2026-03-07 | JetBrains | Multiple | 71.0%± 2.9 | |
25 | Droid | Claude Opus 4.6 | 2026-02-05 | Factory | Anthropic | 69.9%± 2.5 | |
26 | Ante | Gemini 3 Pro | 2026-01-06 | Antigma Labs | 69.4%± 2.1 | ||
27 | IndusAGI Coding Agent | GPT-5.3-Codex | 2026-03-18 | Varun Israni (SoloVpx) | OpenAI | 69.1%± 2.3 | |
28 | Crux | Claude Opus 4.6 | 2026-02-23 | Roam | Anthropic | 66.9%± N/A | |
29 | Deep Agents | GPT-5.2-Codex | 2026-02-12 | LangChain | OpenAI | 66.5%± 3.1 | |
30 | Mux | Claude Opus 4.6 | 2026-02-13 | Coder | Anthropic | 66.5%± 2.5 | |
31 | clnkr | GPT-5.5 | 2026-05-14 | clnkr | OpenAI | 66.2%± 2.5 | |
32 | SageAgent | Gemini 3 Pro | 2026-02-23 | OpenSage | 65.2%± 2.1 | ||
33 | Droid | GPT-5.2 | 2025-12-24 | Factory | OpenAI | 64.9%± 2.8 | |
34 | Terminus 2 | GPT-5.3-Codex | 2026-02-05 | Terminal-Bench | OpenAI | 64.7%± 2.7 | |
35 | Junie CLI | Gemini 3 Flash | 2025-12-23 | JetBrains | 64.3%± 2.8 | ||
36 | Droid | Claude Opus 4.5 | 2025-12-11 | Factory | Anthropic | 63.1%± 2.7 | |
37 | Codex CLI | GPT-5.2 | 2025-12-18 | OpenAI | OpenAI | 62.9%± 3.0 | |
38 | Terminus 2 | Claude Opus 4.6 | 2026-02-06 | Terminal-Bench | Anthropic | 62.9%± 2.7 | |
39 | CodeBrain-1.5 | Gemini 3 Pro | 2026-02-05 | Feeling AI | 62.2%± 2.6 | ||
40 | II-Agent | Gemini 3 Pro | 2025-12-23 | Intelligent Internet | 61.8%± 2.8 | ||
41 | hookele | GPT-5.1-Codex-Mini | 2026-05-14 | Dmitry Barakhov | OpenAI | 61.6%± 1.9 | |
42 | Warp | Multiple | 2025-12-12 | Warp | Multiple | 61.2%± 3.0 | |
43 | Droid | Gemini 3 Pro | 2025-12-24 | Factory | 61.1%± 2.8 | ||
44 | Mux | GPT-5.2 | 2026-01-17 | Coder | OpenAI | 60.7%± N/A | |
45 | Codex CLI | GPT-5.1-Codex-Max | 2025-11-24 | OpenAI | OpenAI | 60.4%± 2.7 | |
46 | Gemini CLI | Gemini 3.1 Pro | 2026-05-14 | 60.1%± N/A | |||
47 | Letta Code | Claude Opus 4.5 | 2025-12-17 | Letta | Anthropic | 59.1%± 2.4 | |
48 | Warp | Multiple | 2025-11-20 | Warp | Multiple | 59.1%± 2.8 | |
49 | Abacus AI Desktop | Multiple | 2025-12-11 | Abacus.AI | Multiple | 58.4%± 2.8 | |
50 | Mux | Claude Opus 4.5 | 2026-01-17 | Coder | Anthropic | 58.4%± N/A | |
51 | Claude Code | Claude Opus 4.6 | 2026-02-07 | Anthropic | Anthropic | 58.0%± 2.9 | |
52 | Terminus 2 | Claude Opus 4.5 | 2025-11-22 | Terminal-Bench | Anthropic | 57.8%± 2.5 | |
53 | Crux | GPT-5.1-Codex | 2025-11-16 | Roam | OpenAI | 57.8%± 2.9 | |
54 | Grok CLI | Grok 4.20 Reasoning | 2026-04-02 | Superagent | xAI | 57.3%± N/A | |
55 | Terminus 2 | Gemini 3 Pro | 2025-11-21 | Terminal-Bench | 56.9%± 2.5 | ||
56 | Letta Code | Gemini 3 Pro | 2025-12-17 | Letta | 56.0%± 3.0 | ||
57 | Goose | Claude Opus 4.5 | 2025-12-11 | Block | Anthropic | 54.3%± 2.6 | |
58 | Terminus 2 | GPT-5.2 | 2025-12-12 | Terminal-Bench | OpenAI | 54.0%± 2.9 | |
59 | Letta Code | GPT-5.1-Codex | 2025-12-17 | Letta | OpenAI | 53.5%± 2.8 | |
60 | Simplai Agent | Claude Sonnet 4.6 | 2026-05-14 | SimplAI | Anthropic | 53.4%± 2.8 | |
61 | Terminus 2 | GLM 5 | 2026-02-23 | Terminal-Bench | Z-AI | 52.4%± 2.6 | |
62 | Claude Code | Claude Opus 4.5 | 2025-12-18 | Anthropic | Anthropic | 52.1%± 2.5 | |
63 | OpenHands | Claude Opus 4.5 | 2026-01-04 | OpenHands | Anthropic | 51.9%± 2.9 | |
64 | Terminus 2 | Gemini 3 Flash | 2026-01-07 | Terminal-Bench | 51.7%± 3.1 | ||
65 | OpenCode | Claude Opus 4.5 | 2026-01-12 | Anomaly Innovations | Anthropic | 51.7%± N/A | |
66 | Warp | Multiple | 2025-11-11 | Warp | Multiple | 50.1%± 2.7 | |
67 | Codex CLI | GPT-5 | 2025-11-04 | OpenAI | OpenAI | 49.6%± 2.9 | |
68 | Terminus 2 | GPT-5.1 | 2025-11-16 | Terminal-Bench | OpenAI | 47.6%± 2.8 | |
69 | Gemini CLI | Gemini 3 Flash | 2026-03-06 | 47.4%± 3.0 | |||
70 | CAMEL-AI | Claude Sonnet 4.5 | 2025-12-24 | CAMEL-AI | Anthropic | 46.5%± 2.4 | |
71 | IndusAGI Coding Agent | MiniMax M2.7 | 2026-05-14 | Varun Israni (SoloVpx) | Minimax | 45.1%± N/A | |
72 | Codex CLI | GPT-5-Codex | 2025-11-04 | OpenAI | OpenAI | 44.3%± 2.7 | |
73 | OpenHands | GPT-5 | 2025-11-02 | OpenHands | OpenAI | 43.8%± 3.0 | |
74 | Harness Agent | MiniMax M2.7 Highspeed | 2026-05-14 | lazyFrogLOL | MiniMax | 43.8%± 2.9 | |
75 | Terminus 2 | GPT-5-Codex | 2025-10-31 | Terminal-Bench | OpenAI | 43.4%± 2.9 | |
76 | Terminus 2 | Kimi K2.5 | 2026-02-04 | Terminal-Bench | Kimi | 43.2%± 2.9 | |
77 | Goose | Claude Sonnet 4.5 | 2025-12-11 | Block | Anthropic | 43.1%± 2.6 | |
78 | Crux | GPT-5.1-Codex-Mini | 2025-11-17 | Roam | OpenAI | 43.1%± 3.0 | |
79 | Terminus 2 | Claude Sonnet 4.5 | 2025-10-31 | Terminal-Bench | Anthropic | 42.8%± 2.8 | |
80 | MAYA-V2 | Claude 4.5 Sonnet | 2026-01-04 | ADYA | Anthropic | 42.7%± N/A | |
81 | cchuter | minimax-m2.5 | 2026-03-30 | teamblobfish.com | minimax | 42.7%± 2.8 | |
82 | OpenHands | Claude Sonnet 4.5 | 2025-11-02 | OpenHands | Anthropic | 42.6%± 2.8 | |
83 | Mini-SWE-Agent | Claude Sonnet 4.5 | 2025-11-03 | Princeton | Anthropic | 42.5%± 2.8 | |
84 | Terminus 2 | Minimax m2.5 | 2026-02-23 | Terminal-Bench | Minimax | 42.2%± 2.6 | |
85 | Mini-SWE-Agent | GPT-5-Codex | 2025-11-03 | Princeton | OpenAI | 41.3%± 2.8 | |
86 | Claude Code | Claude Sonnet 4.5 | 2025-11-04 | Anthropic | Anthropic | 40.1%± 2.9 | |
87 | Terminus 2 | DeepSeek-V3.2 | 2026-02-10 | Terminal-Bench | DeepSeek | 39.6%± 2.8 | |
88 | Terminus 2 | Claude Opus 4.1 | 2025-10-31 | Terminal-Bench | Anthropic | 38.0%± 2.6 | |
89 | OpenHands | Claude Opus 4.1 | 2025-11-02 | OpenHands | Anthropic | 36.9%± 2.7 | |
90 | Terminus 2 | GPT-5.1-Codex | 2025-11-17 | Terminal-Bench | OpenAI | 36.9%± 3.2 | |
91 | Crux | MiniMax M2.1 | 2025-12-22 | Roam | MiniMax | 36.6%± 2.9 | |
92 | Terminus 2 | Kimi K2 Thinking | 2025-11-11 | Terminal-Bench | Moonshot AI | 35.7%± 2.8 | |
93 | Goose | Claude Haiku 4.5 | 2025-12-11 | Block | Anthropic | 35.5%± 2.9 | |
94 | Terminus 2 | GPT-5 | 2025-10-31 | Terminal-Bench | OpenAI | 35.2%± 3.1 | |
95 | Mini-SWE-Agent | Claude Opus 4.1 | 2025-11-03 | Princeton | Anthropic | 35.1%± 2.5 | |
96 | spoox-o-m | GPT-5-Mini | 2025-12-24 | TUM | OpenAI | 34.8%± 2.7 | |
97 | Claude Code | Claude Opus 4.1 | 2025-11-04 | Anthropic | Anthropic | 34.8%± 2.9 | |
98 | Mini-SWE-Agent | GPT-5 | 2025-11-03 | Princeton | OpenAI | 33.9%± 2.9 | |
99 | Terminus 2 | GLM 4.7 | 2026-01-28 | Terminal-Bench | Z-AI | 33.4%± 2.8 | |
100 | Crux | GLM 4.7 | 2026-02-08 | Roam | Z-AI | 33.3%± 2.5 | |
101 | Terminus 2 | Gemini 2.5 Pro | 2025-10-31 | Terminal-Bench | 32.6%± 3.0 | ||
102 | Codex CLI | GPT-5-Mini | 2025-11-04 | OpenAI | OpenAI | 31.9%± 3.0 | |
103 | Terminus 2 | MiniMax M2 | 2025-11-01 | Terminal-Bench | MiniMax | 30.0%± 2.7 | |
104 | Mini-SWE-Agent | Claude Haiku 4.5 | 2025-11-03 | Princeton | Anthropic | 29.8%± 2.5 | |
105 | OpenHands | GPT-5-Mini | 2025-11-02 | OpenHands | OpenAI | 29.2%± 2.8 | |
106 | Terminus 2 | MiniMax M2.1 | 2025-12-23 | Terminal-Bench | MiniMax | 29.2%± 2.9 | |
107 | Terminus 2 | Claude Haiku 4.5 | 2025-10-31 | Terminal-Bench | Anthropic | 28.3%± 2.9 | |
108 | Terminus 2 | Kimi K2 Instruct | 2025-11-01 | Terminal-Bench | Moonshot AI | 27.8%± 2.5 | |
109 | Claude Code | Claude Haiku 4.5 | 2025-11-04 | Anthropic | Anthropic | 27.5%± 2.8 | |
110 | Dakou Agent | Qwen 3 Coder 480B | 2025-12-28 | iflow | Alibaba | 27.2%± 2.6 | |
111 | OpenHands | Grok 4 | 2025-11-02 | OpenHands | xAI | 27.2%± 3.1 | |
112 | OpenHands | Kimi K2 Instruct | 2025-11-02 | OpenHands | Moonshot AI | 26.7%± 2.7 | |
113 | Mini-SWE-Agent | Gemini 2.5 Pro | 2025-11-03 | Princeton | 26.1%± 2.5 | ||
114 | Mini-SWE-Agent | Grok Code Fast 1 | 2025-11-03 | Princeton | xAI | 25.8%± 2.6 | |
115 | Mini-SWE-Agent | Grok 4 | 2025-11-03 | Princeton | xAI | 25.4%± 2.9 | |
116 | OpenHands | Qwen 3 Coder 480B | 2025-11-02 | OpenHands | Alibaba | 25.4%± 2.6 | |
117 | little-coder | Qwen3.6-35B-A3B | 2026-05-14 | Itay Inbar | Qwen | 24.6%± 3.2 | |
118 | Terminus 2 | GLM 4.6 | 2025-11-01 | Terminal-Bench | Z.ai | 24.5%± 2.4 | |
119 | Terminus 2 | GPT-5-Mini | 2025-10-31 | Terminal-Bench | OpenAI | 24.0%± 2.5 | |
120 | Terminus 2 | Qwen 3 Coder 480B | 2025-11-01 | Terminal-Bench | Alibaba | 23.9%± 2.8 | |
121 | Terminus 2 | Grok 4 | 2025-10-31 | Terminal-Bench | xAI | 23.1%± 2.9 | |
122 | little-coder | Qwen3.6-35B-A3B | 2026-05-14 | Itay Inbar | Qwen | 23.0%± N/A | |
123 | Mini-SWE-Agent | GPT-5-Mini | 2025-11-03 | Princeton | OpenAI | 22.2%± 2.6 | |
124 | spoox-o-m | GPT-5-Nano | 2026-05-15 | TUM | OpenAI | 21.8%± 2.8 | |
125 | Gemini CLI | Gemini 2.5 Pro | 2025-11-04 | 19.6%± 2.9 | |||
126 | Bash Agent | TermiGen-32B | 2026-05-14 | UCSB-SURFI | Qwen | 19.3%± 2.0 | |
127 | Terminus 2 | GPT-OSS-120B | 2025-11-01 | Terminal-Bench | OpenAI | 18.7%± 2.7 | |
128 | Mini-SWE-Agent | Gemini 2.5 Flash | 2025-11-03 | Princeton | 17.1%± 2.5 | ||
129 | Terminus 2 | AfterQuery-GPT-OSS-20B | 2026-03-31 | Terminal-Bench | AfterQuery | 17.0%± 2.5 | |
130 | Terminus 2 | Gemini 2.5 Flash | 2025-10-31 | Terminal-Bench | 16.9%± 2.4 | ||
131 | OpenHands | Gemini 2.5 Pro | 2025-11-02 | OpenHands | 16.4%± 2.8 | ||
132 | OpenHands | Gemini 2.5 Flash | 2025-11-02 | OpenHands | 16.4%± 2.4 | ||
133 | Gemini CLI | Gemini 2.5 Flash | 2025-11-04 | 15.4%± 2.3 | |||
134 | Mini-SWE-Agent | GPT-OSS-120B | 2025-11-03 | Princeton | OpenAI | 14.2%± 2.3 | |
135 | Terminus 2 | Grok Code Fast 1 | 2025-10-31 | Terminal-Bench | xAI | 14.2%± 2.5 | |
136 | OpenHands | Claude Haiku 4.5 | 2025-11-02 | OpenHands | Anthropic | 13.9%± 2.7 | |
137 | Codex CLI | GPT-5-Nano | 2025-11-04 | OpenAI | OpenAI | 11.5%± 2.3 | |
138 | OpenHands | GPT-5-Nano | 2025-11-02 | OpenHands | OpenAI | 9.9%± 2.1 | |
139 | little-coder | Qwen3.5-9B | 2026-05-14 | Itay Inbar | Qwen | 9.2%± 2.4 | |
140 | Terminus 2 | GPT-5-Nano | 2025-10-31 | Terminal-Bench | OpenAI | 7.9%± 1.9 | |
141 | Mini-SWE-Agent | GPT-5-Nano | 2025-11-03 | Princeton | OpenAI | 7.0%± 1.9 | |
142 | Mini-SWE-Agent | GPT-OSS-20B | 2025-11-03 | Princeton | OpenAI | 3.4%± 1.4 | |
143 | Terminus 2 | GPT-OSS-20B | 2025-11-01 | Terminal-Bench | OpenAI | 3.1%± 1.5 |
Results in this leaderboard correspond to terminal-bench@2.0.
Submission instructions can be found at harborframework/terminal-bench-2-leaderboard
A Terminal-Bench team member ran the evaluation and verified the results.
Displaying 143 of 143 available entries