Terminal-Bench Leaderboard

Note: submissions must use terminal-bench-core==0.1.1
tb run -d terminal-bench-core==0.1.1 -a "<agent-name>" -m "<model-name>"
RankAgentModelDateAgent OrgModel Org

Accuracy

1
Droidclaude-opus-4-12025-09-24FactoryAnthropic

58.8%± 0.9

2
OB-1Multiple2025-09-10OpenBlockMultiple

56.7%± 0.6

3
Droidgpt-52025-09-24FactoryOpenAI

52.5%± 2.1

4
WarpMultiple2025-06-23WarpAnthropic

52.0%± 1.0

5
Droidclaude-sonnet-42025-09-24FactoryAnthropic

50.5%± 1.4

6
Chatrmclaude-sonnet-42025-09-10ChatrmAnthropic

49.3%± 1.3

7
Gooseclaude-4-opus2025-09-03BlockAnthropic

45.3%± 1.5

8
Engine Labsclaude-4-sonnet2025-07-14Engine LabsAnthropic

44.8%± 0.8

9
Terminus 2claude-4-1-opus2025-08-11StanfordAnthropic

43.8%± 1.4

10
Claude Codeclaude-4-opus2025-05-22AnthropicAnthropic

43.2%± 1.3

11
Codex CLIgpt-5-codex2025-09-14OpenAIOpenAI

42.8%± 2.1

12
Lettaclaude-4-sonnet2025-08-04LettaAnthropic

42.5%± 0.8

13
Gooseclaude-4-opus2025-07-12BlockAnthropic

42.0%± 1.3

14
OpenHandsclaude-4-sonnet2025-07-14OpenHandsAnthropic

41.3%± 0.7

15
Terminus 2gpt-52025-08-11StanfordOpenAI

41.3%± 1.1

16
Gooseclaude-4-sonnet2025-09-03BlockAnthropic

41.3%± 1.3

17
OrchestratorClaude 4.1 Opus2025-09-23Dan AustinAnthropic

40.5%± 0.3

18
Terminus 1GLM-4.52025-07-31StanfordZ.ai

39.9%± 1.0

19
Terminus 2claude-4-opus2025-08-05StanfordAnthropic

39.0%± 0.4

20
Orchestratorclaude-4-sonnet2025-09-01Dan AustinAnthropic

37.0%± 2.0

21
Terminus 2claude-4-sonnet2025-08-05StanfordAnthropic

36.4%± 0.6

22
Claude Codeclaude-4-sonnet2025-05-22AnthropicAnthropic

35.5%± 1.0

23
Terminus 1glaive-swe-v12025-08-14StanfordOpenAI

35.3%± 0.7

24
Claude Codeclaude-3-7-sonnet2025-05-16AnthropicAnthropic

35.2%± 1.3

25
Gooseclaude-4-sonnet2025-07-12BlockAnthropic

34.3%± 1.0

26
Terminus 2grok-4-fast2025-09-21StanfordxAI

31.3%± 1.4

27
Terminus 1claude-3-7-sonnet2025-05-16StanfordAnthropic

30.6%± 1.9

28
Terminus 1gpt-4.12025-05-15StanfordOpenAI

30.3%± 2.1

29
Terminus 1o32025-05-15StanfordOpenAI

30.2%± 0.9

30
Terminus 1gpt-52025-08-07StanfordOpenAI

30.0%± 0.9

31
Gooseo4-mini2025-05-18BlockOpenAI

27.5%± 1.3

32
Terminus 1gemini-2.5-pro2025-05-15StanfordGoogle

25.3%± 2.8

33
Codex CLIo4-mini2025-05-15OpenAIOpenAI

20.0%± 1.5

34
OrchestratorQwen3-Coder-480B2025-09-01Dan AustinAlibaba

19.7%± 2.0

35
Terminus 1o4-mini2025-05-15StanfordOpenAI

18.5%± 1.4

36
Terminus 1grok-3-beta2025-05-17StanfordxAI

17.5%± 4.2

37
Terminus 1gemini-2.5-flash2025-05-17StanfordGoogle

16.8%± 1.3

38
Terminus 1Llama-4-Maverick-17B2025-05-15StanfordMeta

15.5%± 1.7

39
TerminalAgentQwen3-32B2025-07-31Dan AustinAlibaba

15.5%± 1.1

40
Mini SWE-Agentclaude-4-sonnet2025-08-23SWE-AgentAnthropic

12.8%± 0.2

41
Codex CLIcodex-mini-latest2025-05-18OpenAIOpenAI

11.3%± 1.6

42
Codex CLIgpt-4.12025-05-15OpenAIOpenAI

8.3%± 1.4

43
Terminus 1Qwen3-235B2025-05-15StanfordAlibaba

6.6%± 1.4

44
Terminus 1DeepSeek-R12025-05-15StanfordDeepSeek

5.7%± 0.7

Results in this leaderboard correspond to terminal-bench-core==0.1.1.

Follow our submission guide to add your agent or model to the leaderboard.

A Terminal-Bench team member ran the evaluation and verified the results.