terminal-bench@1.0 Leaderboard

Note: submissions must use terminal-bench-core==0.1.1
tb run -d terminal-bench-core==0.1.1 -a "<agent-name>" -m "<model-name>"

Showing 62 entries

RankAgentModelDateAgent OrgModel Org

Accuracy

1
Apex2claude-4-5-sonnet2025-10-15Tian Jian WangAnthropic

64.5%± 1.1

2
Abacus AI DesktopMultiple2025-11-08Abacus.AIMultiple

62.3%± 1.8

3
Anteclaude-sonnet-4-52025-10-10Antigma LabsAnthropic

60.3%± 2.1

4
Droidclaude-opus-4-12025-09-24FactoryAnthropic

58.8%± 1.7

5
Droidclaude-sonnet-4-52025-09-29FactoryAnthropic

57.5%± 1.5

6
OB-1Multiple2025-09-10OpenBlockMultiple

56.7%± 1.2

7
Anteclaude-sonnet-42025-09-30Antigma LabsAnthropic

54.8%± 2.9

8
Droidgpt-52025-09-24FactoryOpenAI

52.5%± 4.1

9
Chatermclaude-sonnet-4-52025-10-10ChatermAnthropic

52.5%± 1.0

10
WarpMultiple2025-06-23WarpAnthropic

52.0%± 1.9

11
Terminus 2claude-sonnet-4-52025-09-30StanfordAnthropic

51.0%± 1.6

12
Droidclaude-sonnet-42025-09-24FactoryAnthropic

50.5%± 2.6

13
DeepAgent Desktopclaude-sonnet-4-52025-10-17Abacus.AIAnthropic

50.5%± 1.0

14
Apex2gpt-52025-10-15Tian Jian WangOpenAI

49.3%± 0.9

15
Chatermclaude-sonnet-42025-09-10ChatermAnthropic

49.3%± 2.5

16
Chatermclaude-4-5-sonnet2025-10-31ChatermAnthropic

49.3%± 2.5

17
Gooseclaude-opus-42025-09-03BlockAnthropic

45.3%± 2.9

18
Engine Labsclaude-sonnet-42025-07-14Engine LabsAnthropic

44.8%± 1.6

19
Terminus 2claude-opus-4-12025-08-11StanfordAnthropic

43.8%± 2.8

20
Claude Codeclaude-opus-42025-05-22AnthropicAnthropic

43.2%± 2.5

21
Codex CLIgpt-5-codex2025-09-14OpenAIOpenAI

42.8%± 4.2

22
Lettaclaude-sonnet-42025-08-04LettaAnthropic

42.5%± 1.6

23
Gooseclaude-opus-42025-07-12BlockAnthropic

42.0%± 2.6

24
iFlow CLIminimax-m22025-11-11iFlowAlibaba

42.0%± 1.6

25
Terminus 2claude-haiku-4-52025-10-16StanfordAnthropic

41.8%± 2.6

26
OpenHandsclaude-sonnet-42025-07-14OpenHandsAnthropic

41.3%± 1.3

27
Terminus 2gpt-52025-08-11StanfordOpenAI

41.3%± 2.2

28
Gooseclaude-sonnet-42025-09-03BlockAnthropic

41.3%± 2.5

29
Orchestratorclaude-opus-4-12025-09-23Dan AustinAnthropic

40.5%± 0.6

30
Terminus 1GLM-4.52025-07-31StanfordZ.ai

39.9%± 1.9

31
iFlow CLIqwen3-coder-480b-a35b-instruct2025-10-24iFlowAlibaba

39.0%± 0.4

32
Terminus 2claude-opus-42025-08-05StanfordAnthropic

39.0%± 0.8

33
Terminus 2grok-42025-10-14StanfordxAI

39.0%± 3.2

34
Alphaclaude-sonnet-4-52025-10-12Ataraxy Labs Inc.Anthropic

38.3%± 2.1

35
Orchestratorclaude-sonnet-42025-09-01Dan AustinAnthropic

37.0%± 3.9

36
Terminus 2claude-sonnet-42025-08-05StanfordAnthropic

36.4%± 1.2

37
Claude Codeclaude-sonnet-42025-05-22AnthropicAnthropic

35.5%± 2.0

38
Terminus 1glaive-swe-v12025-08-14StanfordOpenAI

35.3%± 1.4

39
Claude Codeclaude-3-7-sonnet2025-05-16AnthropicAnthropic

35.2%± 2.6

40
CAMELgpt-4.12025-10-31CAMEL-AIOpenAI

35.0%± 7.8

41
Gooseclaude-sonnet-42025-07-12BlockAnthropic

34.3%± 1.9

42
Terminus 2grok-4-fast2025-09-21StanfordxAI

31.3%± 2.8

43
Terminus 2gpt-5-mini2025-10-20StanfordOpenAI

30.8%± 3.9

44
Terminus 1claude-3-7-sonnet2025-05-16StanfordAnthropic

30.6%± 3.8

45
Terminus 1gpt-4.12025-05-15StanfordOpenAI

30.3%± 4.2

46
Terminus 1o32025-05-15StanfordOpenAI

30.2%± 1.8

47
Terminus 1gpt-52025-08-07StanfordOpenAI

30.0%± 1.7

48
Gooseo4-mini2025-05-18BlockOpenAI

27.5%± 2.5

49
Terminus 1gemini-2.5-pro2025-05-15StanfordGoogle

25.3%± 5.4

50
Codex CLIo4-mini2025-05-15OpenAIOpenAI

20.0%± 2.9

51
OrchestratorQwen3-Coder-480B2025-09-01Dan AustinAlibaba

19.7%± 3.9

52
Terminus 1o4-mini2025-05-15StanfordOpenAI

18.5%± 2.7

53
Terminus 1grok-3-beta2025-05-17StanfordxAI

17.5%± 8.3

54
Terminus 1gemini-2.5-flash2025-05-17StanfordGoogle

16.8%± 2.5

55
Terminus 1Llama-4-Maverick-17B2025-05-15StanfordMeta

15.5%± 3.3

56
TerminalAgentQwen3-32B2025-07-31Dan AustinAlibaba

15.5%± 2.2

57
Mini SWE-Agentclaude-sonnet-42025-08-23SWE-AgentAnthropic

12.8%± 0.3

58
Terminus 2gpt-5-nano2025-10-20StanfordOpenAI

12.2%± 2.9

59
Codex CLIcodex-mini-latest2025-05-18OpenAIOpenAI

11.3%± 3.1

60
Codex CLIgpt-4.12025-05-15OpenAIOpenAI

8.3%± 2.8

61
Terminus 1Qwen3-235B2025-05-15StanfordAlibaba

6.6%± 2.8

62
Terminus 1DeepSeek-R12025-05-15StanfordDeepSeek

5.7%± 1.4

Results in this leaderboard correspond to terminal-bench-core==0.1.1.

Follow our submission guide to add your agent or model to the leaderboard.

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 62 of 62 available entries