terminal-bench@1.0 Leaderboard
tb run -d terminal-bench-core==0.1.1 -a "<agent-name>" -m "<model-name>"Showing 62 entries
Select agents
Select models
Select organizations
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy |
|---|---|---|---|---|---|---|
1 | Apex2 | claude-4-5-sonnet | 2025-10-15 | Tian Jian Wang | Anthropic | 64.5%± 1.1 |
2 | Abacus AI Desktop | Multiple | 2025-11-08 | Abacus.AI | Multiple | 62.3%± 1.8 |
3 | Ante | claude-sonnet-4-5 | 2025-10-10 | Antigma Labs | Anthropic | 60.3%± 2.1 |
4 | Droid | claude-opus-4-1 | 2025-09-24 | Factory | Anthropic | 58.8%± 1.7 |
5 | Droid | claude-sonnet-4-5 | 2025-09-29 | Factory | Anthropic | 57.5%± 1.5 |
6 | OB-1 | Multiple | 2025-09-10 | OpenBlock | Multiple | 56.7%± 1.2 |
7 | Ante | claude-sonnet-4 | 2025-09-30 | Antigma Labs | Anthropic | 54.8%± 2.9 |
8 | Droid | gpt-5 | 2025-09-24 | Factory | OpenAI | 52.5%± 4.1 |
9 | Chaterm | claude-sonnet-4-5 | 2025-10-10 | Chaterm | Anthropic | 52.5%± 1.0 |
10 | Warp | Multiple | 2025-06-23 | Warp | Anthropic | 52.0%± 1.9 |
11 | Terminus 2 | claude-sonnet-4-5 | 2025-09-30 | Stanford | Anthropic | 51.0%± 1.6 |
12 | Droid | claude-sonnet-4 | 2025-09-24 | Factory | Anthropic | 50.5%± 2.6 |
13 | DeepAgent Desktop | claude-sonnet-4-5 | 2025-10-17 | Abacus.AI | Anthropic | 50.5%± 1.0 |
14 | Apex2 | gpt-5 | 2025-10-15 | Tian Jian Wang | OpenAI | 49.3%± 0.9 |
15 | Chaterm | claude-sonnet-4 | 2025-09-10 | Chaterm | Anthropic | 49.3%± 2.5 |
16 | Chaterm | claude-4-5-sonnet | 2025-10-31 | Chaterm | Anthropic | 49.3%± 2.5 |
17 | Goose | claude-opus-4 | 2025-09-03 | Block | Anthropic | 45.3%± 2.9 |
18 | Engine Labs | claude-sonnet-4 | 2025-07-14 | Engine Labs | Anthropic | 44.8%± 1.6 |
19 | Terminus 2 | claude-opus-4-1 | 2025-08-11 | Stanford | Anthropic | 43.8%± 2.8 |
20 | Claude Code | claude-opus-4 | 2025-05-22 | Anthropic | Anthropic | 43.2%± 2.5 |
21 | Codex CLI | gpt-5-codex | 2025-09-14 | OpenAI | OpenAI | 42.8%± 4.2 |
22 | Letta | claude-sonnet-4 | 2025-08-04 | Letta | Anthropic | 42.5%± 1.6 |
23 | Goose | claude-opus-4 | 2025-07-12 | Block | Anthropic | 42.0%± 2.6 |
24 | iFlow CLI | minimax-m2 | 2025-11-11 | iFlow | Alibaba | 42.0%± 1.6 |
25 | Terminus 2 | claude-haiku-4-5 | 2025-10-16 | Stanford | Anthropic | 41.8%± 2.6 |
26 | OpenHands | claude-sonnet-4 | 2025-07-14 | OpenHands | Anthropic | 41.3%± 1.3 |
27 | Terminus 2 | gpt-5 | 2025-08-11 | Stanford | OpenAI | 41.3%± 2.2 |
28 | Goose | claude-sonnet-4 | 2025-09-03 | Block | Anthropic | 41.3%± 2.5 |
29 | Orchestrator | claude-opus-4-1 | 2025-09-23 | Dan Austin | Anthropic | 40.5%± 0.6 |
30 | Terminus 1 | GLM-4.5 | 2025-07-31 | Stanford | Z.ai | 39.9%± 1.9 |
31 | iFlow CLI | qwen3-coder-480b-a35b-instruct | 2025-10-24 | iFlow | Alibaba | 39.0%± 0.4 |
32 | Terminus 2 | claude-opus-4 | 2025-08-05 | Stanford | Anthropic | 39.0%± 0.8 |
33 | Terminus 2 | grok-4 | 2025-10-14 | Stanford | xAI | 39.0%± 3.2 |
34 | Alpha | claude-sonnet-4-5 | 2025-10-12 | Ataraxy Labs Inc. | Anthropic | 38.3%± 2.1 |
35 | Orchestrator | claude-sonnet-4 | 2025-09-01 | Dan Austin | Anthropic | 37.0%± 3.9 |
36 | Terminus 2 | claude-sonnet-4 | 2025-08-05 | Stanford | Anthropic | 36.4%± 1.2 |
37 | Claude Code | claude-sonnet-4 | 2025-05-22 | Anthropic | Anthropic | 35.5%± 2.0 |
38 | Terminus 1 | glaive-swe-v1 | 2025-08-14 | Stanford | OpenAI | 35.3%± 1.4 |
39 | Claude Code | claude-3-7-sonnet | 2025-05-16 | Anthropic | Anthropic | 35.2%± 2.6 |
40 | CAMEL | gpt-4.1 | 2025-10-31 | CAMEL-AI | OpenAI | 35.0%± 7.8 |
41 | Goose | claude-sonnet-4 | 2025-07-12 | Block | Anthropic | 34.3%± 1.9 |
42 | Terminus 2 | grok-4-fast | 2025-09-21 | Stanford | xAI | 31.3%± 2.8 |
43 | Terminus 2 | gpt-5-mini | 2025-10-20 | Stanford | OpenAI | 30.8%± 3.9 |
44 | Terminus 1 | claude-3-7-sonnet | 2025-05-16 | Stanford | Anthropic | 30.6%± 3.8 |
45 | Terminus 1 | gpt-4.1 | 2025-05-15 | Stanford | OpenAI | 30.3%± 4.2 |
46 | Terminus 1 | o3 | 2025-05-15 | Stanford | OpenAI | 30.2%± 1.8 |
47 | Terminus 1 | gpt-5 | 2025-08-07 | Stanford | OpenAI | 30.0%± 1.7 |
48 | Goose | o4-mini | 2025-05-18 | Block | OpenAI | 27.5%± 2.5 |
49 | Terminus 1 | gemini-2.5-pro | 2025-05-15 | Stanford | 25.3%± 5.4 | |
50 | Codex CLI | o4-mini | 2025-05-15 | OpenAI | OpenAI | 20.0%± 2.9 |
51 | Orchestrator | Qwen3-Coder-480B | 2025-09-01 | Dan Austin | Alibaba | 19.7%± 3.9 |
52 | Terminus 1 | o4-mini | 2025-05-15 | Stanford | OpenAI | 18.5%± 2.7 |
53 | Terminus 1 | grok-3-beta | 2025-05-17 | Stanford | xAI | 17.5%± 8.3 |
54 | Terminus 1 | gemini-2.5-flash | 2025-05-17 | Stanford | 16.8%± 2.5 | |
55 | Terminus 1 | Llama-4-Maverick-17B | 2025-05-15 | Stanford | Meta | 15.5%± 3.3 |
56 | TerminalAgent | Qwen3-32B | 2025-07-31 | Dan Austin | Alibaba | 15.5%± 2.2 |
57 | Mini SWE-Agent | claude-sonnet-4 | 2025-08-23 | SWE-Agent | Anthropic | 12.8%± 0.3 |
58 | Terminus 2 | gpt-5-nano | 2025-10-20 | Stanford | OpenAI | 12.2%± 2.9 |
59 | Codex CLI | codex-mini-latest | 2025-05-18 | OpenAI | OpenAI | 11.3%± 3.1 |
60 | Codex CLI | gpt-4.1 | 2025-05-15 | OpenAI | OpenAI | 8.3%± 2.8 |
61 | Terminus 1 | Qwen3-235B | 2025-05-15 | Stanford | Alibaba | 6.6%± 2.8 |
62 | Terminus 1 | DeepSeek-R1 | 2025-05-15 | Stanford | DeepSeek | 5.7%± 1.4 |
Results in this leaderboard correspond to terminal-bench-core==0.1.1.
Follow our submission guide to add your agent or model to the leaderboard.
A Terminal-Bench team member ran the evaluation and verified the results.
Displaying 62 of 62 available entries