terminal-bench@2.0 Leaderboard

Note: submissions must use terminal-bench@2.0
harbor run -d terminal-bench@2.0 -a "<agent-name>" -m "<model-name>" -k 5

Showing 55 entries

RankAgentModelDateAgent OrgModel Org

Accuracy

1
Codex CLIUnknown2025-11-16OpenAIUnknown

60.5%± N/A

2
WarpMultiple2025-11-11WarpMultiple

50.1%± 2.7

3
Codex CLIGPT-52025-11-04OpenAIOpenAI

49.6%± 2.9

4
Codex CLIGPT-5-Codex2025-11-04OpenAIOpenAI

44.3%± 2.7

5
OpenHandsGPT-52025-11-02OpenHandsOpenAI

43.8%± 3.0

6
Terminus 2GPT-5-Codex2025-10-31StanfordOpenAI

43.4%± 2.9

7
Terminus 2Claude Sonnet 4.52025-10-31StanfordAnthropic

42.8%± 2.8

8
OpenHandsClaude Sonnet 4.52025-11-02OpenHandsAnthropic

42.6%± 2.8

9
Mini-SWE-AgentClaude Sonnet 4.52025-11-03PrincetonAnthropic

42.5%± 2.8

10
Terminus 2Unknown2025-11-16StanfordUnknown

41.8%± 2.9

11
Mini-SWE-AgentGPT-5-Codex2025-11-03PrincetonOpenAI

41.3%± 2.8

12
Claude CodeClaude Sonnet 4.52025-11-04AnthropicAnthropic

40.1%± 2.9

13
Terminus 2Claude Opus 4.12025-10-31StanfordAnthropic

38.0%± 2.6

14
OpenHandsClaude Opus 4.12025-11-02OpenHandsAnthropic

36.9%± 2.7

15
Terminus 2Kimi K2 Thinking2025-11-11StanfordMoonshot AI

35.7%± 2.8

16
Terminus 2GPT-52025-10-31StanfordOpenAI

35.2%± 3.1

17
Mini-SWE-AgentClaude Opus 4.12025-11-03PrincetonAnthropic

35.1%± 2.5

18
Claude CodeClaude Opus 4.12025-11-04AnthropicAnthropic

34.8%± 2.9

19
Mini-SWE-AgentGPT-52025-11-03PrincetonOpenAI

33.9%± 2.9

20
Terminus 2Gemini 2.5 Pro2025-10-31StanfordGoogle

32.6%± 3.0

21
Codex CLIGPT-5-Mini2025-11-04OpenAIOpenAI

31.9%± 3.0

22
Terminus 2MiniMax M22025-11-01StanfordMiniMax

30.0%± 2.7

23
Mini-SWE-AgentClaude Haiku 4.52025-11-03PrincetonAnthropic

29.8%± 2.5

24
OpenHandsGPT-5-Mini2025-11-02OpenHandsOpenAI

29.2%± 2.8

25
Terminus 2Claude Haiku 4.52025-10-31StanfordAnthropic

28.3%± 2.9

26
Terminus 2Kimi K2 Instruct2025-11-01StanfordMoonshot AI

27.8%± 2.5

27
Claude CodeClaude Haiku 4.52025-11-04AnthropicAnthropic

27.5%± 2.8

28
OpenHandsGrok 42025-11-02OpenHandsxAI

27.2%± 3.1

29
OpenHandsKimi K2 Instruct2025-11-02OpenHandsMoonshot AI

26.7%± 2.7

30
Mini-SWE-AgentGemini 2.5 Pro2025-11-03PrincetonGoogle

26.1%± 2.5

31
Mini-SWE-AgentGrok Code Fast 12025-11-03PrincetonxAI

25.8%± 2.6

32
Mini-SWE-AgentGrok 42025-11-03PrincetonxAI

25.4%± 2.9

33
OpenHandsQwen 3 Coder 480B2025-11-02OpenHandsAlibaba

25.4%± 2.6

34
Terminus 2Unknown2025-11-13StanfordUnknown

24.7%± 2.5

35
Terminus 2GLM 4.62025-11-01StanfordZ.ai

24.5%± 2.4

36
Terminus 2GPT-5-Mini2025-10-31StanfordOpenAI

24.0%± 2.5

37
Terminus 2Qwen 3 Coder 480B2025-11-01StanfordAlibaba

23.9%± 2.8

38
Terminus 2Grok 42025-10-31StanfordxAI

23.1%± 2.9

39
Mini-SWE-AgentGPT-5-Mini2025-11-03PrincetonOpenAI

22.2%± 2.6

40
Gemini CLIGemini 2.5 Pro2025-11-04GoogleGoogle

19.6%± 2.9

41
Terminus 2GPT-OSS-120B2025-11-01StanfordOpenAI

18.7%± 2.7

42
Mini-SWE-AgentGemini 2.5 Flash2025-11-03PrincetonGoogle

17.1%± 2.5

43
Terminus 2Gemini 2.5 Flash2025-10-31StanfordGoogle

16.9%± 2.4

44
OpenHandsGemini 2.5 Pro2025-11-02OpenHandsGoogle

16.4%± 2.8

45
OpenHandsGemini 2.5 Flash2025-11-02OpenHandsGoogle

16.4%± 2.4

46
Gemini CLIGemini 2.5 Flash2025-11-04GoogleGoogle

15.4%± 2.3

47
Terminus 2Grok Code Fast 12025-10-31StanfordxAI

14.2%± 2.5

48
Mini-SWE-AgentGPT-OSS-120B2025-11-03PrincetonOpenAI

14.2%± 2.3

49
OpenHandsClaude Haiku 4.52025-11-02OpenHandsAnthropic

13.9%± 2.7

50
Codex CLIGPT-5-Nano2025-11-04OpenAIOpenAI

11.5%± 2.3

51
OpenHandsGPT-5-Nano2025-11-02OpenHandsOpenAI

9.9%± 2.1

52
Terminus 2GPT-5-Nano2025-10-31StanfordOpenAI

7.9%± 1.9

53
Mini-SWE-AgentGPT-5-Nano2025-11-03PrincetonOpenAI

7.0%± 1.9

54
Mini-SWE-AgentGPT-OSS-20B2025-11-03PrincetonOpenAI

3.4%± 1.4

55
Terminus 2GPT-OSS-20B2025-11-01StanfordOpenAI

3.1%± 1.5

Results in this leaderboard correspond to terminal-bench@2.0.

Send us an email to submit your agents' results: alex@laude.org mikeam@cs.stanford.edu

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 55 of 55 available entries