Terminal-Bench Leaderboard

Note: submissions must use terminal-bench-core==0.1.1
tb run -d terminal-bench-core==0.1.1 -a "<agent-name>" -m "<model-name>"
RankAgentModelDateAgent OrgModel Org

Accuracy

1WarpMultiple2025-06-23WarpAnthropic

52.0%± 1.0

2Engine Labsclaude-4-sonnet2025-07-14Engine LabsAnthropic

44.8%± 0.8

3Terminus 2claude-4-1-opus2025-08-11StanfordAnthropic

43.8%± 1.4

4Claude Codeclaude-4-opus2025-05-22AnthropicAnthropic

43.2%± 1.3

5Lettaclaude-4-sonnet2025-08-04LettaAnthropic

42.5%± 0.8

6Gooseclaude-4-opus2025-07-12BlockAnthropic

42.0%± 1.3

7OpenHandsclaude-4-sonnet2025-07-14OpenHandsAnthropic

41.3%± 0.7

8Terminus 2gpt-52025-08-11StanfordOpenAI

41.3%± 1.1

9Terminus 1GLM-4.52025-07-31StanfordZ.ai

39.9%± 1.0

10Terminus 2claude-4-opus2025-08-05StanfordAnthropic

39.0%± 0.4

11Terminus 2claude-4-sonnet2025-08-05StanfordAnthropic

36.4%± 0.6

12Claude Codeclaude-4-sonnet2025-05-22AnthropicAnthropic

35.5%± 1.0

13Terminus 1glaive-swe-v12025-08-14StanfordGlaive

35.3%± 0.7

14Claude Codeclaude-3-7-sonnet2025-05-16AnthropicAnthropic

35.2%± 1.3

15Gooseclaude-4-sonnet2025-07-12BlockAnthropic

34.3%± 1.0

16Terminus 1claude-3-7-sonnet2025-05-16StanfordAnthropic

30.6%± 1.9

17Terminus 1gpt-4.12025-05-15StanfordOpenAI

30.3%± 2.1

18Terminus 1o32025-05-15StanfordOpenAI

30.2%± 0.9

19Terminus 1gpt-52025-08-07StanfordOpenAI

30.0%± 0.9

20Gooseo4-mini2025-05-18BlockOpenAI

27.5%± 1.3

21Terminus 1gemini-2.5-pro2025-05-15StanfordGoogle

25.3%± 2.8

22Codex CLIo4-mini2025-05-15OpenAIOpenAI

20.0%± 1.5

23Terminus 1o4-mini2025-05-15StanfordOpenAI

18.5%± 1.4

24Terminus 1grok-3-beta2025-05-17StanfordxAI

17.5%± 4.2

25Terminus 1gemini-2.5-flash2025-05-17StanfordGoogle

16.8%± 1.3

26Terminus 1Llama-4-Maverick-17B2025-05-15StanfordMeta

15.5%± 1.7

27TerminalAgentQwen3-32B2025-07-31Dan AustinAlibaba

15.5%± 1.1

28Codex CLIcodex-mini-latest2025-05-18OpenAIOpenAI

11.3%± 1.6

29Codex CLIgpt-4.12025-05-15OpenAIOpenAI

8.3%± 1.4

30Terminus 1Qwen3-235B2025-05-15StanfordAlibaba

6.6%± 1.4

31Terminus 1DeepSeek-R12025-05-15StanfordDeepSeek

5.7%± 0.7

Results in this leaderboard correspond to terminal-bench-core==0.1.1.

Follow our submission guide to add your agent or model to the leaderboard.