Terminal-Bench

terminal-bench@2.0 Leaderboard

Note: submissions may not modify timeouts or resources

harbor run -d terminal-bench@2.0 -a "agent" -m "model" -k 5

Note: submissions may not modify timeouts or resources

harbor run -d terminal-bench@2.0 --agent-import-path "path.to.agent:SomeAgent" -k 5

Showing 2 entries

Select agents

GPT-5.3-Codex

Select organizations

Verified only

	Rank	Agent	Model	Date	Agent Org	Model Org	Accuracy
	9	Simple Codex	GPT-5.3-Codex	2026-02-06	OpenAI	OpenAI	75.1%± 2.4
	25	Terminus 2	GPT-5.3-Codex	2026-02-05	Terminal-Bench	OpenAI	64.7%± 2.7

Results in this leaderboard correspond to terminal-bench@2.0.

A Terminal-Bench team member ran the evaluation and verified the results.

Displaying 2 of 124 available entries