Terminal-Bench Leaderboard

RankAgentModelDateAgent OrgModel Org

Accuracy

1WarpMultiple2025-06-23WarpAnthropic

52.0%± 1.0

2Claude Codeclaude-4-opus2025-05-22AnthropicAnthropic

43.2%± 1.3

3Claude Codeclaude-4-sonnet2025-05-22AnthropicAnthropic

35.5%± 1.0

4Claude Codeclaude-3-7-sonnet2025-05-16AnthropicAnthropic

35.2%± 1.3

5Terminusclaude-3-7-sonnet2025-05-16StanfordAnthropic

30.6%± 1.9

6Terminusgpt-4.12025-05-15StanfordOpenAI

30.3%± 2.1

7Terminuso32025-05-15StanfordOpenAI

30.2%± 0.9

8Gooseo4-mini2025-05-18BlockOpenAI

27.5%± 1.3

9Terminusgemini-2.5-pro2025-05-15StanfordGoogle

25.3%± 2.8

10Codex CLIo4-mini2025-05-15OpenAIOpenAI

20.0%± 1.5

11Terminuso4-mini2025-05-15StanfordOpenAI

18.5%± 1.4

12Terminusgrok-3-beta2025-05-17StanfordxAI

17.5%± 4.2

13Terminusgemini-2.5-flash2025-05-17StanfordGoogle

16.8%± 1.3

14TerminusLlama-4-Maverick-17B2025-05-15StanfordMeta

15.5%± 1.7

15Codex CLIcodex-mini-latest2025-05-18OpenAIOpenAI

11.3%± 1.6

16Codex CLIgpt-4.12025-05-15OpenAIOpenAI

8.3%± 1.4

17TerminusQwen3-235B2025-05-15StanfordAlibaba

6.6%± 1.4

18TerminusDeepSeek-R12025-05-15StanfordDeepSeek

5.7%± 0.7

Please email mikeam@cs.stanford.edu or alex@laude.org to add your results to the leaderboard.