terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 89 entries

RankAgentModelDateAgent OrgModel Org

Accuracy

1
DroidGPT-5.22025-12-24FactoryOpenAI

64.9%± 2.8

2
AnteGemini 3 Pro2026-01-06Antigma LabsGoogle

64.7%± 2.7

3
Junie CLIGemini 3 Flash2025-12-23JetBrainsGoogle

64.3%± 2.8

4
DroidClaude Opus 4.52025-12-11FactoryAnthropic

63.1%± 2.7

5
Codex CLIGPT-5.22025-12-18OpenAIOpenAI

62.9%± 3.0

6
II-AgentGemini 3 Pro2025-12-23Intelligent InternetGoogle

61.8%± 2.8

7
WarpMultiple2025-12-12WarpMultiple

61.2%± 3.0

8
DroidGemini 3 Pro2025-12-24FactoryGoogle

61.1%± 2.8

9
MuxGPT-5.22026-01-17CoderOpenAI

60.7%± N/A

10
Codex CLIGPT-5.1-Codex-Max2025-11-24OpenAIOpenAI

60.4%± 2.7

11
Letta CodeClaude Opus 4.52025-12-17LettaAnthropic

59.1%± 2.4

12
WarpMultiple2025-11-20WarpMultiple

59.1%± 2.8

13
Abacus AI DesktopMultiple2025-12-11Abacus.AIMultiple

58.4%± 2.8

14
MuxClaude Opus 4.52026-01-17CoderAnthropic

58.4%± N/A

15
Terminus 2Claude Opus 4.52025-11-22StanfordAnthropic

57.8%± 2.5

16
Codex CLIGPT-5.1-Codex2025-11-16OpenAIOpenAI

57.8%± 2.9

17
Terminus 2Gemini 3 Pro2025-11-21StanfordGoogle

56.9%± 2.5

18
Letta CodeGemini 3 Pro2025-12-17LettaGoogle

56.0%± 3.0

19
GooseClaude Opus 4.52025-12-11BlockAnthropic

54.3%± 2.6

20
Terminus 2GPT-5.22025-12-12StanfordOpenAI

54.0%± 2.9

21
Letta CodeGPT-5.1-Codex2025-12-17LettaOpenAI

53.5%± 2.8

22
Claude CodeClaude Opus 4.52025-12-18AnthropicAnthropic

52.1%± 2.5

23
OpenHandsClaude Opus 4.52026-01-04OpenHandsAnthropic

51.9%± 2.9

24
Terminus 2Gemini 3 Flash2026-01-07StanfordGoogle

51.7%± 3.1

25
OpenCodeClaude Opus 4.52026-01-12Anomaly InnovationsAnthropic

51.7%± N/A

26
Gemini CLIGemini 3 Flash2025-12-23GoogleGoogle

51.0%± 3.0

27
WarpMultiple2025-11-11WarpMultiple

50.1%± 2.7

28
Codex CLIGPT-52025-11-04OpenAIOpenAI

49.6%± 2.9

29
Terminus 2GPT-5.12025-11-16StanfordOpenAI

47.6%± 2.8

30
CAMEL-AIClaude Sonnet 4.52025-12-24CAMEL-AIAnthropic

46.5%± 2.4

31
Codex CLIGPT-5-Codex2025-11-04OpenAIOpenAI

44.3%± 2.7

32
OpenHandsGPT-52025-11-02OpenHandsOpenAI

43.8%± 3.0

33
Terminus 2GPT-5-Codex2025-10-31StanfordOpenAI

43.4%± 2.9

34
Codex CLIGPT-5.1-Codex-Mini2025-11-17OpenAIOpenAI

43.1%± 3.0

35
GooseClaude Sonnet 4.52025-12-11BlockAnthropic

43.1%± 2.6

36
Terminus 2Claude Sonnet 4.52025-10-31StanfordAnthropic

42.8%± 2.8

37
MAYAClaude 4.5 Sonnet2026-01-04ADYAAnthropic

42.7%± N/A

38
OpenHandsClaude Sonnet 4.52025-11-02OpenHandsAnthropic

42.6%± 2.8

39
Mini-SWE-AgentClaude Sonnet 4.52025-11-03PrincetonAnthropic

42.5%± 2.8

40
Mini-SWE-AgentGPT-5-Codex2025-11-03PrincetonOpenAI

41.3%± 2.8

41
Claude CodeClaude Sonnet 4.52025-11-04AnthropicAnthropic

40.1%± 2.9

42
Terminus 2Claude Opus 4.12025-10-31StanfordAnthropic

38.0%± 2.6

43
Terminus 2GPT-5.1-Codex2025-11-17StanfordOpenAI

36.9%± 3.2

44
OpenHandsClaude Opus 4.12025-11-02OpenHandsAnthropic

36.9%± 2.7

45
Claude CodeMiniMax2025-12-22AnthropicMiniMax-M2.1

36.6%± 2.9

46
Terminus 2Kimi K2 Thinking2025-11-11StanfordMoonshot AI

35.7%± 2.8

47
GooseClaude Haiku 4.52025-12-11BlockAnthropic

35.5%± 2.9

48
Terminus 2GPT-52025-10-31StanfordOpenAI

35.2%± 3.1

49
Mini-SWE-AgentClaude Opus 4.12025-11-03PrincetonAnthropic

35.1%± 2.5

50
spoox-mGPT-5-Mini2025-12-24TUMOpenAI

34.8%± 2.7

51
Claude CodeClaude Opus 4.12025-11-04AnthropicAnthropic

34.8%± 2.9

52
Mini-SWE-AgentGPT-52025-11-03PrincetonOpenAI

33.9%± 2.9

53
Terminus 2Gemini 2.5 Pro2025-10-31StanfordGoogle

32.6%± 3.0

54
Codex CLIGPT-5-Mini2025-11-04OpenAIOpenAI

31.9%± 3.0

55
Terminus 2MiniMax M22025-11-01StanfordMiniMax

30.0%± 2.7

56
Mini-SWE-AgentClaude Haiku 4.52025-11-03PrincetonAnthropic

29.8%± 2.5

57
Terminus 2MiniMax2025-12-23StanfordMiniMax-M2.1

29.2%± 2.9

58
OpenHandsGPT-5-Mini2025-11-02OpenHandsOpenAI

29.2%± 2.8

59
Terminus 2Claude Haiku 4.52025-10-31StanfordAnthropic

28.3%± 2.9

60
Terminus 2Kimi K2 Instruct2025-11-01StanfordMoonshot AI

27.8%± 2.5

61
Claude CodeClaude Haiku 4.52025-11-04AnthropicAnthropic

27.5%± 2.8

62
Dakou AgentQwen 3 Coder 480B2025-12-28iflowAlibaba

27.2%± 2.6

63
OpenHandsGrok 42025-11-02OpenHandsxAI

27.2%± 3.1

64
OpenHandsKimi K2 Instruct2025-11-02OpenHandsMoonshot AI

26.7%± 2.7

65
Mini-SWE-AgentGemini 2.5 Pro2025-11-03PrincetonGoogle

26.1%± 2.5

66
Mini-SWE-AgentGrok Code Fast 12025-11-03PrincetonxAI

25.8%± 2.6

67
OpenHandsQwen 3 Coder 480B2025-11-02OpenHandsAlibaba

25.4%± 2.6

68
Mini-SWE-AgentGrok 42025-11-03PrincetonxAI

25.4%± 2.9

69
Terminus 2GLM 4.62025-11-01StanfordZ.ai

24.5%± 2.4

70
Terminus 2GPT-5-Mini2025-10-31StanfordOpenAI

24.0%± 2.5

71
Terminus 2Qwen 3 Coder 480B2025-11-01StanfordAlibaba

23.9%± 2.8

72
Terminus 2Grok 42025-10-31StanfordxAI

23.1%± 2.9

73
Mini-SWE-AgentGPT-5-Mini2025-11-03PrincetonOpenAI

22.2%± 2.6

74
Gemini CLIGemini 2.5 Pro2025-11-04GoogleGoogle

19.6%± 2.9

75
Terminus 2GPT-OSS-120B2025-11-01StanfordOpenAI

18.7%± 2.7

76
Mini-SWE-AgentGemini 2.5 Flash2025-11-03PrincetonGoogle

17.1%± 2.5

77
Terminus 2Gemini 2.5 Flash2025-10-31StanfordGoogle

16.9%± 2.4

78
OpenHandsGemini 2.5 Flash2025-11-02OpenHandsGoogle

16.4%± 2.4

79
OpenHandsGemini 2.5 Pro2025-11-02OpenHandsGoogle

16.4%± 2.8

80
Gemini CLIGemini 2.5 Flash2025-11-04GoogleGoogle

15.4%± 2.3

81
Terminus 2Grok Code Fast 12025-10-31StanfordxAI

14.2%± 2.5

82
Mini-SWE-AgentGPT-OSS-120B2025-11-03PrincetonOpenAI

14.2%± 2.3

83
OpenHandsClaude Haiku 4.52025-11-02OpenHandsAnthropic

13.9%± 2.7

84
Codex CLIGPT-5-Nano2025-11-04OpenAIOpenAI

11.5%± 2.3

85
OpenHandsGPT-5-Nano2025-11-02OpenHandsOpenAI

9.9%± 2.1

86
Terminus 2GPT-5-Nano2025-10-31StanfordOpenAI

7.9%± 1.9

87
Mini-SWE-AgentGPT-5-Nano2025-11-03PrincetonOpenAI

7.0%± 1.9

88
Mini-SWE-AgentGPT-OSS-20B2025-11-03PrincetonOpenAI

3.4%± 1.4

89
Terminus 2GPT-OSS-20B2025-11-01StanfordOpenAI

3.1%± 1.5

Results in this leaderboard correspond to terminal-bench@2.0.

Send us an email to submit your agents' results: alex@laude.org mikeam@cs.stanford.edu

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 89 of 89 available entries