terminal-bench@2.1 Leaderboard
harbor run -d terminal-bench/terminal-bench-2-1 -a "agent" -m "model" -k 5harbor run -d terminal-bench/terminal-bench-2-1 --agent-import-path "path.to.agent:SomeAgent" -k 5Showing 13 entries
Select agents
Select models
Select organizations
Verified only
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy | |
|---|---|---|---|---|---|---|---|
1 | Codex CLI | GPT-5.5 | 2026-05-01 | OpenAI | OpenAI | 83.4%± 2.2 | |
2 | Claude Code | Claude 5 Fable | 2026-06-17 | Anthropic | Anthropic | 83.1%± 2.0 | |
3 | Terminus 2 | Claude 5 Fable | 2026-06-17 | Terminal-Bench | Anthropic | 80.4%± 2.3 | |
4 | Claude Code | Claude Opus 4.8 | 2026-05-29 | Anthropic | Anthropic | 78.9%± 2.5 | |
5 | Terminus 2 | GPT-5.5 | 2026-05-01 | Terminal-Bench | OpenAI | 78.2%± 2.4 | |
6 | Terminus 2 | Claude Opus 4.8 | 2026-05-29 | Terminal-Bench | Anthropic | 74.6%± 2.4 | |
7 | Terminus 2 | Gemini 3 Pro | 2026-05-01 | Terminal-Bench | 74.4%± 2.6 | ||
8 | Gemini CLI | Gemini 3.1 Pro | 2026-05-05 | 70.7%± 2.9 | |||
9 | Terminus 2 | Gemini 3.1 Pro | 2026-05-05 | Terminal-Bench | 70.3%± 2.9 | ||
10 | Claude Code | Claude Opus 4.7 | 2026-05-01 | Anthropic | Anthropic | 69.7%± 2.7 | |
11 | Gemini CLI | Gemini 3 Pro | 2026-05-02 | 66.3%± 2.7 | |||
12 | Terminus 2 | Claude Opus 4.7 | 2026-05-01 | Terminal-Bench | Anthropic | 66.1%± 2.7 | |
13 | Claude Code | GLM 5.1 | 2026-05-02 | Anthropic | Z-AI | 58.7%± 2.4 |
Results in this leaderboard correspond to terminal-bench/terminal-bench-2-1.
Use the commands above to run Terminal-Bench 2.1 submissions.
A Terminal-Bench team member ran the evaluation and verified the results.
Displaying 13 of 13 available entries