terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5
Note: submissions may not modify timeouts or resources
harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 33 entries

RankAgentModelDateAgent OrgModel Org

Accuracy

24
Terminus 2GPT-5.3-Codex2026-02-05AfterQueryOpenAI

64.7%± 2.7

27
Terminus 2Claude Opus 4.62026-02-06AfterQueryAnthropic

62.9%± 2.7

41
Terminus 2Claude Opus 4.52025-11-22AfterQueryAnthropic

57.8%± 2.5

43
Terminus 2Gemini 3 Pro2025-11-21AfterQueryGoogle

56.9%± 2.5

46
Terminus 2GPT-5.22025-12-12AfterQueryOpenAI

54.0%± 2.9

48
Terminus 2GLM 52026-02-23AfterQueryZ-AI

52.4%± 2.6

52
Terminus 2Gemini 3 Flash2026-01-07AfterQueryGoogle

51.7%± 3.1

55
Terminus 2GPT-5.12025-11-16AfterQueryOpenAI

47.6%± 2.8

60
Terminus 2GPT-5-Codex2025-10-31AfterQueryOpenAI

43.4%± 2.9

61
Terminus 2Kimi K2.52026-02-04AfterQueryKimi

43.2%± 2.9

64
Terminus 2Claude Sonnet 4.52025-10-31AfterQueryAnthropic

42.8%± 2.8

69
Terminus 2Minimax m2.52026-02-23AfterQueryMinimax

42.2%± 2.6

72
Terminus 2DeepSeek-V3.22026-02-10AfterQueryDeepSeek

39.6%± 2.8

73
Terminus 2Claude Opus 4.12025-10-31AfterQueryAnthropic

38.0%± 2.6

75
Terminus 2GPT-5.1-Codex2025-11-17AfterQueryOpenAI

36.9%± 3.2

77
Terminus 2Kimi K2 Thinking2025-11-11AfterQueryMoonshot AI

35.7%± 2.8

79
Terminus 2GPT-52025-10-31AfterQueryOpenAI

35.2%± 3.1

84
Terminus 2GLM 4.72026-01-28AfterQueryZ-AI

33.4%± 2.8

86
Terminus 2Gemini 2.5 Pro2025-10-31AfterQueryGoogle

32.6%± 3.0

88
Terminus 2MiniMax M22025-11-01AfterQueryMiniMax

30.0%± 2.7

90
Terminus 2MiniMax M2.12025-12-23AfterQueryMiniMax

29.2%± 2.9

92
Terminus 2Claude Haiku 4.52025-10-31AfterQueryAnthropic

28.3%± 2.9

93
Terminus 2Kimi K2 Instruct2025-11-01AfterQueryMoonshot AI

27.8%± 2.5

102
Terminus 2GLM 4.62025-11-01AfterQueryZ.ai

24.5%± 2.4

103
Terminus 2GPT-5-Mini2025-10-31AfterQueryOpenAI

24.0%± 2.5

104
Terminus 2Qwen 3 Coder 480B2025-11-01AfterQueryAlibaba

23.9%± 2.8

105
Terminus 2Grok 42025-10-31AfterQueryxAI

23.1%± 2.9

108
Terminus 2GPT-OSS-120B2025-11-01AfterQueryOpenAI

18.7%± 2.7

110
Terminus 2AfterQuery-GPT-OSS-20B2026-03-31AfterQueryAfterQuery

17.0%± 2.5

111
Terminus 2Gemini 2.5 Flash2025-10-31AfterQueryGoogle

16.9%± 2.4

115
Terminus 2Grok Code Fast 12025-10-31AfterQueryxAI

14.2%± 2.5

120
Terminus 2GPT-5-Nano2025-10-31AfterQueryOpenAI

7.9%± 1.9

123
Terminus 2GPT-OSS-20B2025-11-01AfterQueryOpenAI

3.1%± 1.5

Results in this leaderboard correspond to terminal-bench@2.0.

Submission instructions can be found at harborframework/terminal-bench-2-leaderboard

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 33 of 123 available entries