terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 146 entries

Verified only
RankAgentModelDateAgent OrgModel Org

Accuracy

1
vixClaude Opus 4.72026-05-15vixAnthropic

90.2%± 2.1

2
JJAgentMultiple2026-05-15JJMultiple

87.1%± 1.3

3
NexAU-AHEGPT-5.52026-05-14china-qijizhifengOpenAI

84.7%± 2.1

4
LemonHarnessMultiple2026-05-14LR AILab of Lenovo CTO OrgMultiple

84.5%± 2.6

5
CapyGPT-5.52026-05-14CapyOpenAI

83.1%± 2.1

6
PolarisMultiple2026-05-14PolarisOpsMultiple

82.2%± 2.8

7
Codex CLIGPT-5.52026-04-23OpenAIOpenAI

82.0%± 2.2

8
ForgeCodeGPT-5.42026-03-12ForgeCodeOpenAI

81.8%± 2.0

9
WOZCODEClaude Opus 4.72026-05-14WOZCODEAnthropic

80.2%± 2.1

10
TongAgentsGemini 3.1 Pro2026-03-13BIGAIGoogle

80.2%± 2.6

11
LemonHarnessMultiple2026-05-14LR AILab of Lenovo CTO OrgMultiple

79.9%± 3.0

12
ForgeCodeClaude Opus 4.62026-03-12ForgeCodeAnthropic

79.8%± 1.6

13
SageAgentGPT-5.3-Codex2026-03-13OpenSageOpenAI

78.4%± 2.2

14
ForgeCodeGemini 3.1 Pro2026-03-02ForgeCodeGoogle

78.4%± 1.8

15
DroidGPT-5.3-Codex2026-02-24FactoryOpenAI

77.3%± 2.2

16
Meta-HarnessClaude Opus 4.62026-05-14Stanford IRISAnthropic

76.4%± 2.4

17
CodeBrain-1.5GPT-5.3-Codex2026-02-10Feeling AIOpenAI

75.8%± 2.0

18
CodeliaGPT-5.3-Codex2026-05-14kouswOpenAI

75.7%± 2.2

19
CapyClaude Opus 4.62026-03-12CapyAnthropic

75.3%± 2.4

20
Simple CodexGPT-5.3-Codex2026-02-06OpenAIOpenAI

75.1%± 2.4

21
Terminus-KIRAGemini 3.1 Pro2026-02-23KRAFTON AIGoogle

74.8%± 2.6

22
Terminus-KIRAClaude Opus 4.62026-02-22KRAFTON AIAnthropic

74.7%± 2.6

23
MuxGPT-5.3-Codex2026-03-06CoderOpenAI

74.6%± 2.5

24
MAYA-V2Claude 4.6 Opus2026-03-12ADYAAnthropic

72.1%± 2.2

25
TongAgentsClaude Opus 4.62026-02-22BigaiAnthropic

71.9%± 2.7

26
spoox-o-mGPT-5.3-Codex2026-05-15TUMOpenAI

71.5%± 2.5

27
Junie CLIMultiple2026-03-07JetBrainsMultiple

71.0%± 2.9

28
DroidClaude Opus 4.62026-02-05FactoryAnthropic

69.9%± 2.5

29
AnteGemini 3 Pro2026-01-06Antigma LabsGoogle

69.4%± 2.1

30
IndusAGI Coding AgentGPT-5.3-Codex2026-03-18Varun Israni (SoloVpx)OpenAI

69.1%± 2.3

31
CruxClaude Opus 4.62026-02-23RoamAnthropic

66.9%± N/A

32
MuxClaude Opus 4.62026-02-13CoderAnthropic

66.5%± 2.5

33
Deep AgentsGPT-5.2-Codex2026-02-12LangChainOpenAI

66.5%± 3.1

34
clnkrGPT-5.52026-05-14clnkrOpenAI

66.2%± 2.5

35
SageAgentGemini 3 Pro2026-02-23OpenSageGoogle

65.2%± 2.1

36
DroidGPT-5.22025-12-24FactoryOpenAI

64.9%± 2.8

37
Terminus 2GPT-5.3-Codex2026-02-05Terminal-BenchOpenAI

64.7%± 2.7

38
Junie CLIGemini 3 Flash2025-12-23JetBrainsGoogle

64.3%± 2.8

39
DroidClaude Opus 4.52025-12-11FactoryAnthropic

63.1%± 2.7

40
Codex CLIGPT-5.22025-12-18OpenAIOpenAI

62.9%± 3.0

41
Terminus 2Claude Opus 4.62026-02-06Terminal-BenchAnthropic

62.9%± 2.7

42
CodeBrain-1.5Gemini 3 Pro2026-02-05Feeling AIGoogle

62.2%± 2.6

43
II-AgentGemini 3 Pro2025-12-23Intelligent InternetGoogle

61.8%± 2.8

44
hookeleGPT-5.1-Codex-Mini2026-05-14Dmitry BarakhovOpenAI

61.6%± 1.9

45
WarpMultiple2025-12-12WarpMultiple

61.2%± 3.0

46
DroidGemini 3 Pro2025-12-24FactoryGoogle

61.1%± 2.8

47
MuxGPT-5.22026-01-17CoderOpenAI

60.7%± N/A

48
Codex CLIGPT-5.1-Codex-Max2025-11-24OpenAIOpenAI

60.4%± 2.7

49
Gemini CLIGemini 3.1 Pro2026-05-14GoogleGoogle

60.1%± N/A

50
WarpMultiple2025-11-20WarpMultiple

59.1%± 2.8

51
Letta CodeClaude Opus 4.52025-12-17LettaAnthropic

59.1%± 2.4

52
Abacus AI DesktopMultiple2025-12-11Abacus.AIMultiple

58.4%± 2.8

53
MuxClaude Opus 4.52026-01-17CoderAnthropic

58.4%± N/A

54
Claude CodeClaude Opus 4.62026-02-07AnthropicAnthropic

58.0%± 2.9

55
Terminus 2Claude Opus 4.52025-11-22Terminal-BenchAnthropic

57.8%± 2.5

56
CruxGPT-5.1-Codex2025-11-16RoamOpenAI

57.8%± 2.9

57
Grok CLIGrok 4.20 Reasoning2026-04-02SuperagentxAI

57.3%± N/A

58
Terminus 2Gemini 3 Pro2025-11-21Terminal-BenchGoogle

56.9%± 2.5

59
Letta CodeGemini 3 Pro2025-12-17LettaGoogle

56.0%± 3.0

60
GooseClaude Opus 4.52025-12-11BlockAnthropic

54.3%± 2.6

61
Terminus 2GPT-5.22025-12-12Terminal-BenchOpenAI

54.0%± 2.9

62
Letta CodeGPT-5.1-Codex2025-12-17LettaOpenAI

53.5%± 2.8

63
Simplai AgentClaude Sonnet 4.62026-05-14SimplAIAnthropic

53.4%± 2.8

64
Terminus 2GLM 52026-02-23Terminal-BenchZ-AI

52.4%± 2.6

65
Claude CodeClaude Opus 4.52025-12-18AnthropicAnthropic

52.1%± 2.5

66
OpenHandsClaude Opus 4.52026-01-04OpenHandsAnthropic

51.9%± 2.9

67
Terminus 2Gemini 3 Flash2026-01-07Terminal-BenchGoogle

51.7%± 3.1

68
OpenCodeClaude Opus 4.52026-01-12Anomaly InnovationsAnthropic

51.7%± N/A

69
WarpMultiple2025-11-11WarpMultiple

50.1%± 2.7

70
Codex CLIGPT-52025-11-04OpenAIOpenAI

49.6%± 2.9

71
Terminus 2GPT-5.12025-11-16Terminal-BenchOpenAI

47.6%± 2.8

72
Gemini CLIGemini 3 Flash2026-03-06GoogleGoogle

47.4%± 3.0

73
CAMEL-AIClaude Sonnet 4.52025-12-24CAMEL-AIAnthropic

46.5%± 2.4

74
IndusAGI Coding AgentMiniMax M2.72026-05-14Varun Israni (SoloVpx)Minimax

45.1%± N/A

75
Codex CLIGPT-5-Codex2025-11-04OpenAIOpenAI

44.3%± 2.7

76
OpenHandsGPT-52025-11-02OpenHandsOpenAI

43.8%± 3.0

77
Harness AgentMiniMax M2.7 Highspeed2026-05-14lazyFrogLOLMiniMax

43.8%± 2.9

78
Terminus 2GPT-5-Codex2025-10-31Terminal-BenchOpenAI

43.4%± 2.9

79
Terminus 2Kimi K2.52026-02-04Terminal-BenchKimi

43.2%± 2.9

80
GooseClaude Sonnet 4.52025-12-11BlockAnthropic

43.1%± 2.6

81
CruxGPT-5.1-Codex-Mini2025-11-17RoamOpenAI

43.1%± 3.0

82
Terminus 2Claude Sonnet 4.52025-10-31Terminal-BenchAnthropic

42.8%± 2.8

83
MAYA-V2Claude 4.5 Sonnet2026-01-04ADYAAnthropic

42.7%± N/A

84
cchuterminimax-m2.52026-03-30teamblobfish.comminimax

42.7%± 2.8

85
OpenHandsClaude Sonnet 4.52025-11-02OpenHandsAnthropic

42.6%± 2.8

86
Mini-SWE-AgentClaude Sonnet 4.52025-11-03PrincetonAnthropic

42.5%± 2.8

87
Terminus 2Minimax m2.52026-02-23Terminal-BenchMinimax

42.2%± 2.6

88
Mini-SWE-AgentGPT-5-Codex2025-11-03PrincetonOpenAI

41.3%± 2.8

89
Claude CodeClaude Sonnet 4.52025-11-04AnthropicAnthropic

40.1%± 2.9

90
Terminus 2DeepSeek-V3.22026-02-10Terminal-BenchDeepSeek

39.6%± 2.8

91
Terminus 2Claude Opus 4.12025-10-31Terminal-BenchAnthropic

38.0%± 2.6

92
OpenHandsClaude Opus 4.12025-11-02OpenHandsAnthropic

36.9%± 2.7

93
Terminus 2GPT-5.1-Codex2025-11-17Terminal-BenchOpenAI

36.9%± 3.2

94
CruxMiniMax M2.12025-12-22RoamMiniMax

36.6%± 2.9

95
Terminus 2Kimi K2 Thinking2025-11-11Terminal-BenchMoonshot AI

35.7%± 2.8

96
GooseClaude Haiku 4.52025-12-11BlockAnthropic

35.5%± 2.9

97
Terminus 2GPT-52025-10-31Terminal-BenchOpenAI

35.2%± 3.1

98
Mini-SWE-AgentClaude Opus 4.12025-11-03PrincetonAnthropic

35.1%± 2.5

99
spoox-o-mGPT-5-Mini2025-12-24TUMOpenAI

34.8%± 2.7

100
Claude CodeClaude Opus 4.12025-11-04AnthropicAnthropic

34.8%± 2.9

101
Mini-SWE-AgentGPT-52025-11-03PrincetonOpenAI

33.9%± 2.9

102
Terminus 2GLM 4.72026-01-28Terminal-BenchZ-AI

33.4%± 2.8

103
CruxGLM 4.72026-02-08RoamZ-AI

33.3%± 2.5

104
Terminus 2Gemini 2.5 Pro2025-10-31Terminal-BenchGoogle

32.6%± 3.0

105
Codex CLIGPT-5-Mini2025-11-04OpenAIOpenAI

31.9%± 3.0

106
Terminus 2MiniMax M22025-11-01Terminal-BenchMiniMax

30.0%± 2.7

107
Mini-SWE-AgentClaude Haiku 4.52025-11-03PrincetonAnthropic

29.8%± 2.5

108
OpenHandsGPT-5-Mini2025-11-02OpenHandsOpenAI

29.2%± 2.8

109
Terminus 2MiniMax M2.12025-12-23Terminal-BenchMiniMax

29.2%± 2.9

110
Terminus 2Claude Haiku 4.52025-10-31Terminal-BenchAnthropic

28.3%± 2.9

111
Terminus 2Kimi K2 Instruct2025-11-01Terminal-BenchMoonshot AI

27.8%± 2.5

112
Claude CodeClaude Haiku 4.52025-11-04AnthropicAnthropic

27.5%± 2.8

113
Dakou AgentQwen 3 Coder 480B2025-12-28iflowAlibaba

27.2%± 2.6

114
OpenHandsGrok 42025-11-02OpenHandsxAI

27.2%± 3.1

115
OpenHandsKimi K2 Instruct2025-11-02OpenHandsMoonshot AI

26.7%± 2.7

116
Mini-SWE-AgentGemini 2.5 Pro2025-11-03PrincetonGoogle

26.1%± 2.5

117
Mini-SWE-AgentGrok Code Fast 12025-11-03PrincetonxAI

25.8%± 2.6

118
Mini-SWE-AgentGrok 42025-11-03PrincetonxAI

25.4%± 2.9

119
OpenHandsQwen 3 Coder 480B2025-11-02OpenHandsAlibaba

25.4%± 2.6

120
little-coderQwen3.6-35B-A3B2026-05-14Itay InbarQwen

24.6%± 3.2

121
Terminus 2GLM 4.62025-11-01Terminal-BenchZ.ai

24.5%± 2.4

122
Terminus 2GPT-5-Mini2025-10-31Terminal-BenchOpenAI

24.0%± 2.5

123
Terminus 2Qwen 3 Coder 480B2025-11-01Terminal-BenchAlibaba

23.9%± 2.8

124
Terminus 2Grok 42025-10-31Terminal-BenchxAI

23.1%± 2.9

125
little-coderQwen3.6-35B-A3B2026-05-14Itay InbarQwen

23.0%± N/A

126
Mini-SWE-AgentGPT-5-Mini2025-11-03PrincetonOpenAI

22.2%± 2.6

127
spoox-o-mGPT-5-Nano2026-05-15TUMOpenAI

21.8%± 2.8

128
Gemini CLIGemini 2.5 Pro2025-11-04GoogleGoogle

19.6%± 2.9

129
Bash AgentTermiGen-32B2026-05-14UCSB-SURFIQwen

19.3%± 2.0

130
Terminus 2GPT-OSS-120B2025-11-01Terminal-BenchOpenAI

18.7%± 2.7

131
Mini-SWE-AgentGemini 2.5 Flash2025-11-03PrincetonGoogle

17.1%± 2.5

132
Terminus 2AfterQuery-GPT-OSS-20B2026-03-31Terminal-BenchAfterQuery

17.0%± 2.5

133
Terminus 2Gemini 2.5 Flash2025-10-31Terminal-BenchGoogle

16.9%± 2.4

134
OpenHandsGemini 2.5 Pro2025-11-02OpenHandsGoogle

16.4%± 2.8

135
OpenHandsGemini 2.5 Flash2025-11-02OpenHandsGoogle

16.4%± 2.4

136
Gemini CLIGemini 2.5 Flash2025-11-04GoogleGoogle

15.4%± 2.3

137
Mini-SWE-AgentGPT-OSS-120B2025-11-03PrincetonOpenAI

14.2%± 2.3

138
Terminus 2Grok Code Fast 12025-10-31Terminal-BenchxAI

14.2%± 2.5

139
OpenHandsClaude Haiku 4.52025-11-02OpenHandsAnthropic

13.9%± 2.7

140
Codex CLIGPT-5-Nano2025-11-04OpenAIOpenAI

11.5%± 2.3

141
OpenHandsGPT-5-Nano2025-11-02OpenHandsOpenAI

9.9%± 2.1

142
little-coderQwen3.5-9B2026-05-14Itay InbarQwen

9.2%± 2.4

143
Terminus 2GPT-5-Nano2025-10-31Terminal-BenchOpenAI

7.9%± 1.9

144
Mini-SWE-AgentGPT-5-Nano2025-11-03PrincetonOpenAI

7.0%± 1.9

145
Mini-SWE-AgentGPT-OSS-20B2025-11-03PrincetonOpenAI

3.4%± 1.4

146
Terminus 2GPT-OSS-20B2025-11-01Terminal-BenchOpenAI

3.1%± 1.5

Results in this leaderboard correspond to terminal-bench@2.0.

Submission instructions can be found at harborframework/terminal-bench-2-leaderboard

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 146 of 146 available entries