Terminal-Bench

terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources

harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5

Note: submissions may not modify timeouts or resources

harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 3 entries

Select agents

Claude Opus 4.7

Select organizations

Verified only

Rank	Agent	Model	Date	Agent Org	Model Org	Accuracy
1	vix	Claude Opus 4.7	2026-05-15	vix	Anthropic	90.2%± 2.1
6	Polaris	Multiple	2026-05-14	PolarisOps	Multiple	82.2%± 2.8
9	WOZCODE	Claude Opus 4.7	2026-05-14	WOZCODE	Anthropic	80.2%± 2.1

Results in this leaderboard correspond to terminal-bench@2.0.

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 3 of 143 available entries