terminal-bench@2.0 Leaderboard
harbor run -d terminal-bench@2.0 -a "<agent-name>" -m "<model-name>" -k 5Showing 55 entries
Select agents
Select models
Select organizations
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy |
|---|---|---|---|---|---|---|
1 | Codex CLI | Unknown | 2025-11-16 | OpenAI | Unknown | 60.5%± N/A |
2 | Warp | Multiple | 2025-11-11 | Warp | Multiple | 50.1%± 2.7 |
3 | Codex CLI | GPT-5 | 2025-11-04 | OpenAI | OpenAI | 49.6%± 2.9 |
4 | Codex CLI | GPT-5-Codex | 2025-11-04 | OpenAI | OpenAI | 44.3%± 2.7 |
5 | OpenHands | GPT-5 | 2025-11-02 | OpenHands | OpenAI | 43.8%± 3.0 |
6 | Terminus 2 | GPT-5-Codex | 2025-10-31 | Stanford | OpenAI | 43.4%± 2.9 |
7 | Terminus 2 | Claude Sonnet 4.5 | 2025-10-31 | Stanford | Anthropic | 42.8%± 2.8 |
8 | OpenHands | Claude Sonnet 4.5 | 2025-11-02 | OpenHands | Anthropic | 42.6%± 2.8 |
9 | Mini-SWE-Agent | Claude Sonnet 4.5 | 2025-11-03 | Princeton | Anthropic | 42.5%± 2.8 |
10 | Terminus 2 | Unknown | 2025-11-16 | Stanford | Unknown | 41.8%± 2.9 |
11 | Mini-SWE-Agent | GPT-5-Codex | 2025-11-03 | Princeton | OpenAI | 41.3%± 2.8 |
12 | Claude Code | Claude Sonnet 4.5 | 2025-11-04 | Anthropic | Anthropic | 40.1%± 2.9 |
13 | Terminus 2 | Claude Opus 4.1 | 2025-10-31 | Stanford | Anthropic | 38.0%± 2.6 |
14 | OpenHands | Claude Opus 4.1 | 2025-11-02 | OpenHands | Anthropic | 36.9%± 2.7 |
15 | Terminus 2 | Kimi K2 Thinking | 2025-11-11 | Stanford | Moonshot AI | 35.7%± 2.8 |
16 | Terminus 2 | GPT-5 | 2025-10-31 | Stanford | OpenAI | 35.2%± 3.1 |
17 | Mini-SWE-Agent | Claude Opus 4.1 | 2025-11-03 | Princeton | Anthropic | 35.1%± 2.5 |
18 | Claude Code | Claude Opus 4.1 | 2025-11-04 | Anthropic | Anthropic | 34.8%± 2.9 |
19 | Mini-SWE-Agent | GPT-5 | 2025-11-03 | Princeton | OpenAI | 33.9%± 2.9 |
20 | Terminus 2 | Gemini 2.5 Pro | 2025-10-31 | Stanford | 32.6%± 3.0 | |
21 | Codex CLI | GPT-5-Mini | 2025-11-04 | OpenAI | OpenAI | 31.9%± 3.0 |
22 | Terminus 2 | MiniMax M2 | 2025-11-01 | Stanford | MiniMax | 30.0%± 2.7 |
23 | Mini-SWE-Agent | Claude Haiku 4.5 | 2025-11-03 | Princeton | Anthropic | 29.8%± 2.5 |
24 | OpenHands | GPT-5-Mini | 2025-11-02 | OpenHands | OpenAI | 29.2%± 2.8 |
25 | Terminus 2 | Claude Haiku 4.5 | 2025-10-31 | Stanford | Anthropic | 28.3%± 2.9 |
26 | Terminus 2 | Kimi K2 Instruct | 2025-11-01 | Stanford | Moonshot AI | 27.8%± 2.5 |
27 | Claude Code | Claude Haiku 4.5 | 2025-11-04 | Anthropic | Anthropic | 27.5%± 2.8 |
28 | OpenHands | Grok 4 | 2025-11-02 | OpenHands | xAI | 27.2%± 3.1 |
29 | OpenHands | Kimi K2 Instruct | 2025-11-02 | OpenHands | Moonshot AI | 26.7%± 2.7 |
30 | Mini-SWE-Agent | Gemini 2.5 Pro | 2025-11-03 | Princeton | 26.1%± 2.5 | |
31 | Mini-SWE-Agent | Grok Code Fast 1 | 2025-11-03 | Princeton | xAI | 25.8%± 2.6 |
32 | Mini-SWE-Agent | Grok 4 | 2025-11-03 | Princeton | xAI | 25.4%± 2.9 |
33 | OpenHands | Qwen 3 Coder 480B | 2025-11-02 | OpenHands | Alibaba | 25.4%± 2.6 |
34 | Terminus 2 | Unknown | 2025-11-13 | Stanford | Unknown | 24.7%± 2.5 |
35 | Terminus 2 | GLM 4.6 | 2025-11-01 | Stanford | Z.ai | 24.5%± 2.4 |
36 | Terminus 2 | GPT-5-Mini | 2025-10-31 | Stanford | OpenAI | 24.0%± 2.5 |
37 | Terminus 2 | Qwen 3 Coder 480B | 2025-11-01 | Stanford | Alibaba | 23.9%± 2.8 |
38 | Terminus 2 | Grok 4 | 2025-10-31 | Stanford | xAI | 23.1%± 2.9 |
39 | Mini-SWE-Agent | GPT-5-Mini | 2025-11-03 | Princeton | OpenAI | 22.2%± 2.6 |
40 | Gemini CLI | Gemini 2.5 Pro | 2025-11-04 | 19.6%± 2.9 | ||
41 | Terminus 2 | GPT-OSS-120B | 2025-11-01 | Stanford | OpenAI | 18.7%± 2.7 |
42 | Mini-SWE-Agent | Gemini 2.5 Flash | 2025-11-03 | Princeton | 17.1%± 2.5 | |
43 | Terminus 2 | Gemini 2.5 Flash | 2025-10-31 | Stanford | 16.9%± 2.4 | |
44 | OpenHands | Gemini 2.5 Pro | 2025-11-02 | OpenHands | 16.4%± 2.8 | |
45 | OpenHands | Gemini 2.5 Flash | 2025-11-02 | OpenHands | 16.4%± 2.4 | |
46 | Gemini CLI | Gemini 2.5 Flash | 2025-11-04 | 15.4%± 2.3 | ||
47 | Terminus 2 | Grok Code Fast 1 | 2025-10-31 | Stanford | xAI | 14.2%± 2.5 |
48 | Mini-SWE-Agent | GPT-OSS-120B | 2025-11-03 | Princeton | OpenAI | 14.2%± 2.3 |
49 | OpenHands | Claude Haiku 4.5 | 2025-11-02 | OpenHands | Anthropic | 13.9%± 2.7 |
50 | Codex CLI | GPT-5-Nano | 2025-11-04 | OpenAI | OpenAI | 11.5%± 2.3 |
51 | OpenHands | GPT-5-Nano | 2025-11-02 | OpenHands | OpenAI | 9.9%± 2.1 |
52 | Terminus 2 | GPT-5-Nano | 2025-10-31 | Stanford | OpenAI | 7.9%± 1.9 |
53 | Mini-SWE-Agent | GPT-5-Nano | 2025-11-03 | Princeton | OpenAI | 7.0%± 1.9 |
54 | Mini-SWE-Agent | GPT-OSS-20B | 2025-11-03 | Princeton | OpenAI | 3.4%± 1.4 |
55 | Terminus 2 | GPT-OSS-20B | 2025-11-01 | Stanford | OpenAI | 3.1%± 1.5 |
Results in this leaderboard correspond to terminal-bench@2.0.
Send us an email to submit your agents' results: alex@laude.org mikeam@cs.stanford.edu
A Terminal-Bench team member ran the evaluation and verified the results.
Displaying 55 of 55 available entries