terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 115 entries

RankAgentModelDateAgent OrgModel Org

Accuracy

1
Forge CodeGemini 3.1 Pro2026-03-02Forge CodeGoogle

78.4%± 1.8

2
DroidGPT-5.3-Codex2026-02-24FactoryOpenAI

77.3%± 2.2

3
Simple CodexGPT-5.3-Codex2026-02-06OpenAIOpenAI

75.1%± 2.4

4
Terminus-KIRAGemini 3.1 Pro2026-02-23KRAFTON AIGoogle

74.8%± 2.6

5
Terminus-KIRAClaude Opus 4.62026-02-22KRAFTON AIAnthropic

74.7%± 2.6

6
MuxGPT-5.3-Codex2026-03-06CoderOpenAI

74.6%± 2.5

7
OB-1Multiple2026-03-05OpenBlock LabsMultiple

72.4%± 2.3

8
TongAgentsClaude Opus 4.62026-02-22BigaiAnthropic

71.9%± 2.7

9
Junie CLIMultiple2026-03-07JetBrainsMultiple

71.0%± 2.9

10
CodeBrain-1GPT-5.3-Codex2026-02-10Feeling AIOpenAI

70.3%± 2.6

11
DroidClaude Opus 4.62026-02-05FactoryAnthropic

69.9%± 2.5

12
AnteGemini 3 Pro2026-01-06Antigma LabsGoogle

69.4%± 2.1

13
CruxClaude Opus 4.62026-02-23RoamAnthropic

66.9%± N/A

14
Deep AgentsGPT-5.2-Codex2026-02-12LangChainOpenAI

66.5%± 3.1

15
MuxClaude Opus 4.62026-02-13CoderAnthropic

66.5%± 2.5

16
SageAgentGemini 3 Pro2026-02-23OpenSageGoogle

65.2%± 2.1

17
DroidGPT-5.22025-12-24FactoryOpenAI

64.9%± 2.8

18
Terminus 2GPT-5.3-Codex2026-02-05Terminal BenchOpenAI

64.7%± 2.7

19
Junie CLIGemini 3 Flash2025-12-23JetBrainsGoogle

64.3%± 2.8

20
DroidClaude Opus 4.52025-12-11FactoryAnthropic

63.1%± 2.7

21
Codex CLIGPT-5.22025-12-18OpenAIOpenAI

62.9%± 3.0

22
Terminus 2Claude Opus 4.62026-02-06Terminal BenchAnthropic

62.9%± 2.7

23
CodeBrain-1Gemini 3 Pro2026-02-05Feeling AIGoogle

62.2%± 2.6

24
II-AgentGemini 3 Pro2025-12-23Intelligent InternetGoogle

61.8%± 2.8

25
WarpMultiple2025-12-12WarpMultiple

61.2%± 3.0

26
DroidGemini 3 Pro2025-12-24FactoryGoogle

61.1%± 2.8

27
MuxGPT-5.22026-01-17CoderOpenAI

60.7%± N/A

28
Codex CLIGPT-5.1-Codex-Max2025-11-24OpenAIOpenAI

60.4%± 2.7

29
Letta CodeClaude Opus 4.52025-12-17LettaAnthropic

59.1%± 2.4

30
WarpMultiple2025-11-20WarpMultiple

59.1%± 2.8

31
Abacus AI DesktopMultiple2025-12-11Abacus.AIMultiple

58.4%± 2.8

32
MuxClaude Opus 4.52026-01-17CoderAnthropic

58.4%± N/A

33
Claude CodeClaude Opus 4.62026-02-07AnthropicAnthropic

58.0%± 2.9

34
CruxGPT-5.1-Codex2025-11-16RoamOpenAI

57.8%± 2.9

35
Terminus 2Claude Opus 4.52025-11-22Terminal BenchAnthropic

57.8%± 2.5

36
Terminus 2Gemini 3 Pro2025-11-21Terminal BenchGoogle

56.9%± 2.5

37
Letta CodeGemini 3 Pro2025-12-17LettaGoogle

56.0%± 3.0

38
GooseClaude Opus 4.52025-12-11BlockAnthropic

54.3%± 2.6

39
Terminus 2GPT-5.22025-12-12Terminal BenchOpenAI

54.0%± 2.9

40
Letta CodeGPT-5.1-Codex2025-12-17LettaOpenAI

53.5%± 2.8

41
Terminus 2GLM 52026-02-23Terminal BenchZ-AI

52.4%± 2.6

42
Claude CodeClaude Opus 4.52025-12-18AnthropicAnthropic

52.1%± 2.5

43
OpenHandsClaude Opus 4.52026-01-04OpenHandsAnthropic

51.9%± 2.9

44
OpenCodeClaude Opus 4.52026-01-12Anomaly InnovationsAnthropic

51.7%± N/A

45
Terminus 2Gemini 3 Flash2026-01-07Terminal BenchGoogle

51.7%± 3.1

46
Gemini CLIGemini 3 Flash2025-12-23GoogleGoogle

51.0%± 3.0

47
WarpMultiple2025-11-11WarpMultiple

50.1%± 2.7

48
Codex CLIGPT-52025-11-04OpenAIOpenAI

49.6%± 2.9

49
Terminus 2GPT-5.12025-11-16Terminal BenchOpenAI

47.6%± 2.8

50
Gemini CLIGemini 3 Flash2026-03-06GoogleGoogle

47.4%± 3.0

51
CAMEL-AIClaude Sonnet 4.52025-12-24CAMEL-AIAnthropic

46.5%± 2.4

52
Codex CLIGPT-5-Codex2025-11-04OpenAIOpenAI

44.3%± 2.7

53
OpenHandsGPT-52025-11-02OpenHandsOpenAI

43.8%± 3.0

54
Terminus 2GPT-5-Codex2025-10-31Terminal BenchOpenAI

43.4%± 2.9

55
Terminus 2Kimi K2.52026-02-04Terminal BenchKimi

43.2%± 2.9

56
CruxGPT-5.1-Codex-Mini2025-11-17RoamOpenAI

43.1%± 3.0

57
GooseClaude Sonnet 4.52025-12-11BlockAnthropic

43.1%± 2.6

58
Terminus 2Claude Sonnet 4.52025-10-31Terminal BenchAnthropic

42.8%± 2.8

59
MAYAClaude 4.5 Sonnet2026-01-04ADYAAnthropic

42.7%± N/A

60
OpenHandsClaude Sonnet 4.52025-11-02OpenHandsAnthropic

42.6%± 2.8

61
Mini-SWE-AgentClaude Sonnet 4.52025-11-03PrincetonAnthropic

42.5%± 2.8

62
Terminus 2Minimax m2.52026-02-23Terminal BenchMinimax

42.2%± 2.6

63
Mini-SWE-AgentGPT-5-Codex2025-11-03PrincetonOpenAI

41.3%± 2.8

64
Claude CodeClaude Sonnet 4.52025-11-04AnthropicAnthropic

40.1%± 2.9

65
Terminus 2DeepSeek-V3.22026-02-10Terminal BenchDeepSeek

39.6%± 2.8

66
Terminus 2Claude Opus 4.12025-10-31Terminal BenchAnthropic

38.0%± 2.6

67
OpenHandsClaude Opus 4.12025-11-02OpenHandsAnthropic

36.9%± 2.7

68
Terminus 2GPT-5.1-Codex2025-11-17Terminal BenchOpenAI

36.9%± 3.2

69
CruxMiniMax M2.12025-12-22RoamMiniMax

36.6%± 2.9

70
Terminus 2Kimi K2 Thinking2025-11-11Terminal BenchMoonshot AI

35.7%± 2.8

71
GooseClaude Haiku 4.52025-12-11BlockAnthropic

35.5%± 2.9

72
Terminus 2GPT-52025-10-31Terminal BenchOpenAI

35.2%± 3.1

73
Mini-SWE-AgentClaude Opus 4.12025-11-03PrincetonAnthropic

35.1%± 2.5

74
spoox-mGPT-5-Mini2025-12-24TUMOpenAI

34.8%± 2.7

75
Claude CodeClaude Opus 4.12025-11-04AnthropicAnthropic

34.8%± 2.9

76
Mini-SWE-AgentGPT-52025-11-03PrincetonOpenAI

33.9%± 2.9

77
Terminus 2GLM 4.72026-01-28Terminal BenchZ-AI

33.4%± 2.8

78
CruxGLM 4.72026-02-08RoamZ-AI

33.3%± 2.5

79
Terminus 2Gemini 2.5 Pro2025-10-31Terminal BenchGoogle

32.6%± 3.0

80
Codex CLIGPT-5-Mini2025-11-04OpenAIOpenAI

31.9%± 3.0

81
Terminus 2MiniMax M22025-11-01Terminal BenchMiniMax

30.0%± 2.7

82
Mini-SWE-AgentClaude Haiku 4.52025-11-03PrincetonAnthropic

29.8%± 2.5

83
Terminus 2MiniMax M2.12025-12-23Terminal BenchMiniMax

29.2%± 2.9

84
OpenHandsGPT-5-Mini2025-11-02OpenHandsOpenAI

29.2%± 2.8

85
Terminus 2Claude Haiku 4.52025-10-31Terminal BenchAnthropic

28.3%± 2.9

86
Terminus 2Kimi K2 Instruct2025-11-01Terminal BenchMoonshot AI

27.8%± 2.5

87
Claude CodeClaude Haiku 4.52025-11-04AnthropicAnthropic

27.5%± 2.8

88
Dakou AgentQwen 3 Coder 480B2025-12-28iflowAlibaba

27.2%± 2.6

89
OpenHandsGrok 42025-11-02OpenHandsxAI

27.2%± 3.1

90
OpenHandsKimi K2 Instruct2025-11-02OpenHandsMoonshot AI

26.7%± 2.7

91
Mini-SWE-AgentGemini 2.5 Pro2025-11-03PrincetonGoogle

26.1%± 2.5

92
Mini-SWE-AgentGrok Code Fast 12025-11-03PrincetonxAI

25.8%± 2.6

93
Mini-SWE-AgentGrok 42025-11-03PrincetonxAI

25.4%± 2.9

94
OpenHandsQwen 3 Coder 480B2025-11-02OpenHandsAlibaba

25.4%± 2.6

95
Terminus 2GLM 4.62025-11-01Terminal BenchZ.ai

24.5%± 2.4

96
Terminus 2GPT-5-Mini2025-10-31Terminal BenchOpenAI

24.0%± 2.5

97
Terminus 2Qwen 3 Coder 480B2025-11-01Terminal BenchAlibaba

23.9%± 2.8

98
Terminus 2Grok 42025-10-31Terminal BenchxAI

23.1%± 2.9

99
Mini-SWE-AgentGPT-5-Mini2025-11-03PrincetonOpenAI

22.2%± 2.6

100
Gemini CLIGemini 2.5 Pro2025-11-04GoogleGoogle

19.6%± 2.9

101
Terminus 2GPT-OSS-120B2025-11-01Terminal BenchOpenAI

18.7%± 2.7

102
Mini-SWE-AgentGemini 2.5 Flash2025-11-03PrincetonGoogle

17.1%± 2.5

103
Terminus 2Gemini 2.5 Flash2025-10-31Terminal BenchGoogle

16.9%± 2.4

104
OpenHandsGemini 2.5 Pro2025-11-02OpenHandsGoogle

16.4%± 2.8

105
OpenHandsGemini 2.5 Flash2025-11-02OpenHandsGoogle

16.4%± 2.4

106
Gemini CLIGemini 2.5 Flash2025-11-04GoogleGoogle

15.4%± 2.3

107
Mini-SWE-AgentGPT-OSS-120B2025-11-03PrincetonOpenAI

14.2%± 2.3

108
Terminus 2Grok Code Fast 12025-10-31Terminal BenchxAI

14.2%± 2.5

109
OpenHandsClaude Haiku 4.52025-11-02OpenHandsAnthropic

13.9%± 2.7

110
Codex CLIGPT-5-Nano2025-11-04OpenAIOpenAI

11.5%± 2.3

111
OpenHandsGPT-5-Nano2025-11-02OpenHandsOpenAI

9.9%± 2.1

112
Terminus 2GPT-5-Nano2025-10-31Terminal BenchOpenAI

7.9%± 1.9

113
Mini-SWE-AgentGPT-5-Nano2025-11-03PrincetonOpenAI

7.0%± 1.9

114
Mini-SWE-AgentGPT-OSS-20B2025-11-03PrincetonOpenAI

3.4%± 1.4

115
Terminus 2GPT-OSS-20B2025-11-01Terminal BenchOpenAI

3.1%± 1.5

Results in this leaderboard correspond to terminal-bench@2.0.

Send us an email to submit your agents' results: alex@laude.org mikeam@cs.stanford.edu

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 115 of 115 available entries