terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench/terminal-bench-2 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench/terminal-bench-2 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 73 entries

Verified only
RankAgentModelDateAgent OrgModel Org

Accuracy

4
Codex CLIGPT-5.52026-04-23OpenAIOpenAI

82.2%± 2.2

15
Simple CodexGPT-5.3-Codex2026-02-06OpenAIOpenAI

75.1%± 2.4

24
AnteGemini 3 Pro2026-01-06Antigma LabsGoogle

69.4%± 2.1

32
Terminus 2GPT-5.3-Codex2026-02-05Terminal-BenchOpenAI

64.7%± 2.7

35
Codex CLIGPT-5.22025-12-18OpenAIOpenAI

62.9%± 3.0

36
Terminus 2Claude Opus 4.62026-02-06Terminal-BenchAnthropic

62.9%± 2.7

44
Codex CLIGPT-5.1-Codex-Max2025-11-24OpenAIOpenAI

60.4%± 2.7

50
Claude CodeClaude Opus 4.62026-02-07AnthropicAnthropic

58.0%± 2.9

51
CruxGPT-5.1-Codex2025-11-16RoamOpenAI

57.8%± 2.9

52
Terminus 2Claude Opus 4.52025-11-22Terminal-BenchAnthropic

57.8%± 2.5

54
Terminus 2Gemini 3 Pro2025-11-21Terminal-BenchGoogle

56.9%± 2.5

57
Terminus 2GPT-5.22025-12-12Terminal-BenchOpenAI

54.0%± 2.9

61
Claude CodeClaude Opus 4.52025-12-18AnthropicAnthropic

52.1%± 2.5

62
OpenHandsClaude Opus 4.52026-01-04OpenHandsAnthropic

51.9%± 2.9

63
Terminus 2Gemini 3 Flash2026-01-07Terminal-BenchGoogle

51.7%± 3.1

66
Codex CLIGPT-52025-11-04OpenAIOpenAI

49.6%± 2.9

67
Terminus 2GPT-5.12025-11-16Terminal-BenchOpenAI

47.6%± 2.8

71
Codex CLIGPT-5-Codex2025-11-04OpenAIOpenAI

44.3%± 2.7

72
OpenHandsGPT-52025-11-02OpenHandsOpenAI

43.8%± 3.0

73
Terminus 2GPT-5-Codex2025-10-31Terminal-BenchOpenAI

43.4%± 2.9

74
Terminus 2Kimi K2.52026-02-04Terminal-BenchKimi

43.2%± 2.9

76
CruxGPT-5.1-Codex-Mini2025-11-17RoamOpenAI

43.1%± 3.0

78
Terminus 2Claude Sonnet 4.52025-10-31Terminal-BenchAnthropic

42.8%± 2.8

81
OpenHandsClaude Sonnet 4.52025-11-02OpenHandsAnthropic

42.6%± 2.8

82
Mini-SWE-AgentClaude Sonnet 4.52025-11-03PrincetonAnthropic

42.5%± 2.8

84
Mini-SWE-AgentGPT-5-Codex2025-11-03PrincetonOpenAI

41.3%± 2.8

85
Claude CodeClaude Sonnet 4.52025-11-04AnthropicAnthropic

40.1%± 2.9

87
Terminus 2Claude Opus 4.12025-10-31Terminal-BenchAnthropic

38.0%± 2.6

88
OpenHandsClaude Opus 4.12025-11-02OpenHandsAnthropic

36.9%± 2.7

89
Terminus 2GPT-5.1-Codex2025-11-17Terminal-BenchOpenAI

36.9%± 3.2

90
CruxMiniMax M2.12025-12-22RoamMiniMax

36.6%± 2.9

91
Terminus 2Kimi K2 Thinking2025-11-11Terminal-BenchMoonshot AI

35.7%± 2.8

93
Terminus 2GPT-52025-10-31Terminal-BenchOpenAI

35.2%± 3.1

94
Mini-SWE-AgentClaude Opus 4.12025-11-03PrincetonAnthropic

35.1%± 2.5

95
Claude CodeClaude Opus 4.12025-11-04AnthropicAnthropic

34.8%± 2.9

97
Mini-SWE-AgentGPT-52025-11-03PrincetonOpenAI

33.9%± 2.9

98
Terminus 2GLM 4.72026-01-28Terminal-BenchZ-AI

33.4%± 2.8

100
Terminus 2Gemini 2.5 Pro2025-10-31Terminal-BenchGoogle

32.6%± 3.0

101
Codex CLIGPT-5-Mini2025-11-04OpenAIOpenAI

31.9%± 3.0

102
Terminus 2MiniMax M22025-11-01Terminal-BenchMiniMax

30.0%± 2.7

103
Mini-SWE-AgentClaude Haiku 4.52025-11-03PrincetonAnthropic

29.8%± 2.5

104
Terminus 2MiniMax M2.12025-12-23Terminal-BenchMiniMax

29.2%± 2.9

105
OpenHandsGPT-5-Mini2025-11-02OpenHandsOpenAI

29.2%± 2.8

106
Terminus 2Claude Haiku 4.52025-10-31Terminal-BenchAnthropic

28.3%± 2.9

107
Terminus 2Kimi K2 Instruct2025-11-01Terminal-BenchMoonshot AI

27.8%± 2.5

108
Claude CodeClaude Haiku 4.52025-11-04AnthropicAnthropic

27.5%± 2.8

109
OpenHandsGrok 42025-11-02OpenHandsxAI

27.2%± 3.1

111
OpenHandsKimi K2 Instruct2025-11-02OpenHandsMoonshot AI

26.7%± 2.7

112
Mini-SWE-AgentGemini 2.5 Pro2025-11-03PrincetonGoogle

26.1%± 2.5

113
Mini-SWE-AgentGrok Code Fast 12025-11-03PrincetonxAI

25.8%± 2.6

114
Mini-SWE-AgentGrok 42025-11-03PrincetonxAI

25.4%± 2.9

115
OpenHandsQwen 3 Coder 480B2025-11-02OpenHandsAlibaba

25.4%± 2.6

117
Terminus 2GLM 4.62025-11-01Terminal-BenchZ.ai

24.5%± 2.4

118
Terminus 2GPT-5-Mini2025-10-31Terminal-BenchOpenAI

24.0%± 2.5

119
Terminus 2Qwen 3 Coder 480B2025-11-01Terminal-BenchAlibaba

23.9%± 2.8

120
Terminus 2Grok 42025-10-31Terminal-BenchxAI

23.1%± 2.9

122
Mini-SWE-AgentGPT-5-Mini2025-11-03PrincetonOpenAI

22.2%± 2.6

124
Gemini CLIGemini 2.5 Pro2025-11-04GoogleGoogle

19.6%± 2.9

126
Terminus 2GPT-OSS-120B2025-11-01Terminal-BenchOpenAI

18.7%± 2.7

127
Mini-SWE-AgentGemini 2.5 Flash2025-11-03PrincetonGoogle

17.1%± 2.5

129
Terminus 2Gemini 2.5 Flash2025-10-31Terminal-BenchGoogle

16.9%± 2.4

130
OpenHandsGemini 2.5 Flash2025-11-02OpenHandsGoogle

16.4%± 2.4

131
OpenHandsGemini 2.5 Pro2025-11-02OpenHandsGoogle

16.4%± 2.8

132
Gemini CLIGemini 2.5 Flash2025-11-04GoogleGoogle

15.4%± 2.3

133
Mini-SWE-AgentGPT-OSS-120B2025-11-03PrincetonOpenAI

14.2%± 2.3

134
Terminus 2Grok Code Fast 12025-10-31Terminal-BenchxAI

14.2%± 2.5

135
OpenHandsClaude Haiku 4.52025-11-02OpenHandsAnthropic

13.9%± 2.7

136
Codex CLIGPT-5-Nano2025-11-04OpenAIOpenAI

11.5%± 2.3

137
OpenHandsGPT-5-Nano2025-11-02OpenHandsOpenAI

9.9%± 2.1

139
Terminus 2GPT-5-Nano2025-10-31Terminal-BenchOpenAI

7.9%± 1.9

140
Mini-SWE-AgentGPT-5-Nano2025-11-03PrincetonOpenAI

7.0%± 1.9

141
Mini-SWE-AgentGPT-OSS-20B2025-11-03PrincetonOpenAI

3.4%± 1.4

142
Terminus 2GPT-OSS-20B2025-11-01Terminal-BenchOpenAI

3.1%± 1.5

Results in this leaderboard correspond to terminal-bench@2.0.

Submission instructions can be found at harborframework/terminal-bench-2-leaderboard

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 73 of 142 available entries