terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench/terminal-bench-2 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench/terminal-bench-2 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 11 entries

Verified only
RankAgentModelDateAgent OrgModel Org

Accuracy

2
LemonHarnessMultiple2026-05-14LR AILab of Lenovo CTO OrgMultiple

84.5%± 2.6

9
SageAgentGPT-5.3-Codex2026-03-13OpenSageOpenAI

78.4%± 2.2

10
DroidGPT-5.3-Codex2026-02-24FactoryOpenAI

77.3%± 2.2

12
CodeBrain-1.5GPT-5.3-Codex2026-02-10Feeling AIOpenAI

75.8%± 2.0

13
CodeliaGPT-5.3-Codex2026-05-14kouswOpenAI

75.7%± 2.2

15
Simple CodexGPT-5.3-Codex2026-02-06OpenAIOpenAI

75.1%± 2.4

18
MuxGPT-5.3-Codex2026-03-06CoderOpenAI

74.6%± 2.5

21
spoox-o-mGPT-5.3-Codex2026-05-15TUMOpenAI

71.5%± 2.5

22
Junie CLIMultiple2026-03-07JetBrainsMultiple

71.0%± 2.9

25
IndusAGI Coding AgentGPT-5.3-Codex2026-03-18Varun Israni (SoloVpx)OpenAI

69.1%± 2.3

32
Terminus 2GPT-5.3-Codex2026-02-05Terminal-BenchOpenAI

64.7%± 2.7

Results in this leaderboard correspond to terminal-bench@2.0.

Submission instructions can be found at harborframework/terminal-bench-2-leaderboard

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 11 of 142 available entries