terminal-bench@2.0 Leaderboard
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5Showing 33 entries
Terminus 2
Select models
Select organizations
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy | |
|---|---|---|---|---|---|---|---|
24 | Terminus 2 | GPT-5.3-Codex | 2026-02-05 | AfterQuery | OpenAI | 64.7%± 2.7 | |
27 | Terminus 2 | Claude Opus 4.6 | 2026-02-06 | AfterQuery | Anthropic | 62.9%± 2.7 | |
41 | Terminus 2 | Claude Opus 4.5 | 2025-11-22 | AfterQuery | Anthropic | 57.8%± 2.5 | |
43 | Terminus 2 | Gemini 3 Pro | 2025-11-21 | AfterQuery | 56.9%± 2.5 | ||
46 | Terminus 2 | GPT-5.2 | 2025-12-12 | AfterQuery | OpenAI | 54.0%± 2.9 | |
48 | Terminus 2 | GLM 5 | 2026-02-23 | AfterQuery | Z-AI | 52.4%± 2.6 | |
52 | Terminus 2 | Gemini 3 Flash | 2026-01-07 | AfterQuery | 51.7%± 3.1 | ||
55 | Terminus 2 | GPT-5.1 | 2025-11-16 | AfterQuery | OpenAI | 47.6%± 2.8 | |
60 | Terminus 2 | GPT-5-Codex | 2025-10-31 | AfterQuery | OpenAI | 43.4%± 2.9 | |
61 | Terminus 2 | Kimi K2.5 | 2026-02-04 | AfterQuery | Kimi | 43.2%± 2.9 | |
64 | Terminus 2 | Claude Sonnet 4.5 | 2025-10-31 | AfterQuery | Anthropic | 42.8%± 2.8 | |
69 | Terminus 2 | Minimax m2.5 | 2026-02-23 | AfterQuery | Minimax | 42.2%± 2.6 | |
72 | Terminus 2 | DeepSeek-V3.2 | 2026-02-10 | AfterQuery | DeepSeek | 39.6%± 2.8 | |
73 | Terminus 2 | Claude Opus 4.1 | 2025-10-31 | AfterQuery | Anthropic | 38.0%± 2.6 | |
75 | Terminus 2 | GPT-5.1-Codex | 2025-11-17 | AfterQuery | OpenAI | 36.9%± 3.2 | |
77 | Terminus 2 | Kimi K2 Thinking | 2025-11-11 | AfterQuery | Moonshot AI | 35.7%± 2.8 | |
79 | Terminus 2 | GPT-5 | 2025-10-31 | AfterQuery | OpenAI | 35.2%± 3.1 | |
84 | Terminus 2 | GLM 4.7 | 2026-01-28 | AfterQuery | Z-AI | 33.4%± 2.8 | |
86 | Terminus 2 | Gemini 2.5 Pro | 2025-10-31 | AfterQuery | 32.6%± 3.0 | ||
88 | Terminus 2 | MiniMax M2 | 2025-11-01 | AfterQuery | MiniMax | 30.0%± 2.7 | |
90 | Terminus 2 | MiniMax M2.1 | 2025-12-23 | AfterQuery | MiniMax | 29.2%± 2.9 | |
92 | Terminus 2 | Claude Haiku 4.5 | 2025-10-31 | AfterQuery | Anthropic | 28.3%± 2.9 | |
93 | Terminus 2 | Kimi K2 Instruct | 2025-11-01 | AfterQuery | Moonshot AI | 27.8%± 2.5 | |
102 | Terminus 2 | GLM 4.6 | 2025-11-01 | AfterQuery | Z.ai | 24.5%± 2.4 | |
103 | Terminus 2 | GPT-5-Mini | 2025-10-31 | AfterQuery | OpenAI | 24.0%± 2.5 | |
104 | Terminus 2 | Qwen 3 Coder 480B | 2025-11-01 | AfterQuery | Alibaba | 23.9%± 2.8 | |
105 | Terminus 2 | Grok 4 | 2025-10-31 | AfterQuery | xAI | 23.1%± 2.9 | |
108 | Terminus 2 | GPT-OSS-120B | 2025-11-01 | AfterQuery | OpenAI | 18.7%± 2.7 | |
110 | Terminus 2 | AfterQuery-GPT-OSS-20B | 2026-03-31 | AfterQuery | AfterQuery | 17.0%± 2.5 | |
111 | Terminus 2 | Gemini 2.5 Flash | 2025-10-31 | AfterQuery | 16.9%± 2.4 | ||
115 | Terminus 2 | Grok Code Fast 1 | 2025-10-31 | AfterQuery | xAI | 14.2%± 2.5 | |
120 | Terminus 2 | GPT-5-Nano | 2025-10-31 | AfterQuery | OpenAI | 7.9%± 1.9 | |
123 | Terminus 2 | GPT-OSS-20B | 2025-11-01 | AfterQuery | OpenAI | 3.1%± 1.5 |
Results in this leaderboard correspond to terminal-bench@2.0.
Submission instructions can be found at harborframework/terminal-bench-2-leaderboard
A Terminal-Bench team member ran the evaluation and verified the results.
Displaying 33 of 123 available entries