Terminal-Bench

terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources

harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5

Note: submissions may not modify timeouts or resources

harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 10 entries

Select agents

Claude Opus 4.6

Select organizations

Rank	Agent	Model	Date	Agent Org	Model Org	Accuracy
1	ForgeCode	Claude Opus 4.6	2026-03-12	ForgeCode	Anthropic	81.8%± 1.7
7	Capy	Claude Opus 4.6	2026-03-12	Capy	Anthropic	75.3%± 2.4
10	Terminus-KIRA	Claude Opus 4.6	2026-02-22	KRAFTON AI	Anthropic	74.7%± 2.6
13	TongAgents	Claude Opus 4.6	2026-02-22	Bigai	Anthropic	71.9%± 2.7
14	Junie CLI	Multiple	2026-03-07	JetBrains	Multiple	71.0%± 2.9
16	Droid	Claude Opus 4.6	2026-02-05	Factory	Anthropic	69.9%± 2.5
19	Crux	Claude Opus 4.6	2026-02-23	Roam	Anthropic	66.9%± N/A
20	Mux	Claude Opus 4.6	2026-02-13	Coder	Anthropic	66.5%± 2.5
28	Terminus 2	Claude Opus 4.6	2026-02-06	AfterQuery	Anthropic	62.9%± 2.7
39	Claude Code	Claude Opus 4.6	2026-02-07	Anthropic	Anthropic	58.0%± 2.9

Results in this leaderboard correspond to terminal-bench@2.0.

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 10 of 122 available entries